Repository: google/fchan-go Branch: master Commit: 9cd80472fe1f Files: 12 Total size: 101.3 KB Directory structure: gitextract_sxcrcuyi/ ├── CONTRIBUTING ├── LICENSE ├── README.md ├── src/ │ ├── bench/ │ │ └── bench.go │ └── fchan/ │ ├── bounded.go │ ├── fchan_test.go │ ├── q.go │ └── unbounded.go └── writeup/ ├── graphs.py ├── latex.template ├── refs.bib └── writeup.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: CONTRIBUTING ================================================ Want to contribute? Great! First, read this page (including the small print at the end). ### Before you contribute Before we can use your code, you must sign the [Google Individual Contributor License Agreement] (https://cla.developers.google.com/about/google-individual) (CLA), which you can do online. The CLA is necessary mainly because you own the copyright to your changes, even after your contribution becomes part of our codebase, so we need your permission to use and distribute your code. We also need to be sure of various other things—for instance that you'll tell us if you know that your code infringes on other people's patents. You don't have to sign the CLA until after you've submitted your code for review and a member has approved it, but you must do it before we can put your code into our codebase. Before you start working on a larger contribution, you should get in touch with us first through the issue tracker with your idea so that we can help out and possibly guide you. Coordinating up front makes it much easier to avoid frustration later on. ### Code reviews All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. ### The small print Contributions made by corporations are covered by a different agreement than the one above, the [Software Grant and Corporate Contributor License Agreement] (https://cla.developers.google.com/about/google-corporate). ================================================ FILE: LICENSE ================================================ Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ================================================ FILE: README.md ================================================ # `fchan`: Fast Channels in Go This package contains implementations of fast and scalable channels in Go. Implementation is in `src/fchan`. To run benchmarks, run `src/bench/bench.go`. `bench.go` is very rudimentary, and modifying the source may be necessary depending on what you want to run; that will change in the future. For details on the algorithm, check out the writeup directory, it includes a pdf and the pandoc markdown used to generate it. **This is a proof of concept only**. This code should *not* be run in production. Comments, criticisms and bugs are all welcome! ## Disclaimer This is not an official Google product. ================================================ FILE: src/bench/bench.go ================================================ // Copyright 2016 Google Inc. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. package main import ( "flag" "fmt" "log" "os" "runtime" "runtime/pprof" "sync" "time" "../fchan" ) // chanRec is a wrapper for a channel-like object used in the benchmarking code // to avoid code duplication type chanRec struct { NewHandle func() interface{} Enqueue func(ch interface{}, e fchan.Elt) Dequeue func(ch interface{}) fchan.Elt } func wrapBounded(bound uint64) *chanRec { ch := fchan.NewBounded(bound) return &chanRec{ NewHandle: func() interface{} { return ch.NewHandle() }, Enqueue: func(ch interface{}, e fchan.Elt) { ch.(*fchan.BoundedChan).Enqueue(e) }, Dequeue: func(ch interface{}) fchan.Elt { return ch.(*fchan.BoundedChan).Dequeue() }, } } func wrapUnbounded() *chanRec { ch := fchan.New() return &chanRec{ NewHandle: func() interface{} { return ch.NewHandle() }, Enqueue: func(ch interface{}, e fchan.Elt) { ch.(*fchan.UnboundedChan).Enqueue(e) }, Dequeue: func(ch interface{}) fchan.Elt { return ch.(*fchan.UnboundedChan).Dequeue() }, } } func wrapChan(chanSize int) *chanRec { ch := make(chan fchan.Elt, chanSize) return &chanRec{ NewHandle: func() interface{} { return nil }, Enqueue: func(_ interface{}, e fchan.Elt) { ch <- e }, Dequeue: func(_ interface{}) fchan.Elt { return <-ch }, } } func benchHelp(N int, chanBase *chanRec, nProcs int) time.Duration { const nIters = 1 var totalTime int64 for iter := 0; iter < nIters; iter++ { var waitSetup, waitBench sync.WaitGroup nProcsPer := nProcs / 2 pt := N / nProcsPer waitSetup.Add(2*nProcsPer + 1) for i := 0; i < nProcsPer; i++ { waitBench.Add(2) go func() { ch := chanBase.NewHandle() var ( m interface{} = 1 msg fchan.Elt = &m ) waitSetup.Done() waitSetup.Wait() for j := 0; j < pt; j++ { chanBase.Enqueue(ch, msg) } waitBench.Done() }() go func() { ch := chanBase.NewHandle() waitSetup.Done() waitSetup.Wait() for j := 0; j < pt; j++ { chanBase.Dequeue(ch) } waitBench.Done() }() } time.Sleep(time.Millisecond * 5) waitSetup.Done() waitSetup.Wait() start := time.Now().UnixNano() waitBench.Wait() end := time.Now().UnixNano() runtime.GC() time.Sleep(time.Second) totalTime += end - start } return time.Duration(totalTime/nIters) * time.Nanosecond } func render(N, numCPUs int, gmp bool, desc string, t time.Duration) { extra := "" if gmp { extra = "GMP" } fmt.Printf("%s%s-%d\t%d\t%v\n", desc, extra, numCPUs, N, t) } var cpuprofile = flag.String("cpuprofile", "", "write cpu profile `file`") func main() { const ( more = 5000 nOps = 10000000 gmpScale = 1 ) flag.Parse() if *cpuprofile != "" { f, err := os.Create(*cpuprofile) if err != nil { log.Fatal("could not create CPU profile: ", err) } if err := pprof.StartCPUProfile(f); err != nil { log.Fatal("could not start CPU profile: ", err) } defer pprof.StopCPUProfile() } for _, pack := range []struct { desc string f func() *chanRec }{ {"Chan10M", func() *chanRec { return wrapChan(nOps) }}, {"Chan1K", func() *chanRec { return wrapChan(1024) }}, {"Chan0", func() *chanRec { return wrapChan(0) }}, {"Bounded1K", func() *chanRec { return wrapBounded(1024) }}, {"Bounded0", func() *chanRec { return wrapBounded(0) }}, {"Unbounded", wrapUnbounded}, } { for _, nprocs := range []int{2, 4, 8, 12, 16, 24, 28, 32} { runtime.GOMAXPROCS(nprocs) dur := benchHelp(nOps, pack.f(), more) render(nOps, nprocs, false, pack.desc, dur) dur = benchHelp(nOps, pack.f(), nprocs*gmpScale) render(nOps, nprocs, true, pack.desc, dur) } } } ================================================ FILE: src/fchan/bounded.go ================================================ // Copyright 2016 Google Inc. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. package fchan import ( "runtime" "sync/atomic" "unsafe" ) var ( s = 1 sentinel = unsafe.Pointer(&s) ) // we use a type synonym here because otherwise it would be possible for a user // to append a value of the same underlying type. If that happened then the // type-assertion in Dequeue could send on one of the elements passed in (which // is wrong, and could potentially deadlock the program as well). type waitch chan struct{} func waitChan() waitch { return make(chan struct{}, 2) } // possible history of values of a cell // waitch ::= channel that a sender waits on when it is over buffer size // recvchan ::= channel that a receiver waits on when it has to receive a value // - nil -> sentinel -> value // - nil -> sentinel -> recvChan // - nil -> value // - nil -> recvChan // These two may require someone to send on the waitch before transitioning // - nil -> waitch -> value // - nil -> waitch -> recvChan // BoundedChan is a thread_local handle onto a bounded channel. type BoundedChan struct { q *queue head, tail *segment bound uint64 } // NewBounded allocates a new queue and returns a handle to that queue. Further // handles are created by calling NewHandle on the result of NewBounded. func NewBounded(bufsz uint64) *BoundedChan { segPtr := &segment{} cur := segPtr for b := uint64(segSize); b < bufsz; b += segSize { cur.Next = &segment{ID: index(b) >> segShift} cur = cur.Next } q := &queue{ H: 0, T: 0, SpareAllocs: segList{MaxSpares: int64(runtime.GOMAXPROCS(0))}, } return &BoundedChan{ q: q, head: segPtr, tail: segPtr, bound: bufsz, } } // NewHandle creates a new handle for the given Queue. func (b *BoundedChan) NewHandle() *BoundedChan { return &BoundedChan{ q: b.q, head: b.head, tail: b.tail, bound: b.bound, } } func (b *BoundedChan) adjust() { // TODO: factor this out into a helper so that bounded and unbounded can // use the same code H := index(atomic.LoadUint64((*uint64)(&b.q.H))) T := index(atomic.LoadUint64((*uint64)(&b.q.T))) cellH, _ := H.SplitInd() advance(&b.head, cellH) cellT, _ := T.SplitInd() advance(&b.tail, cellT) } // tryCas attempts to cas seg.Data[index] from nil to elt, and if that fails, // from sentinel to elt. func tryCas(seg *segment, segInd index, elt unsafe.Pointer) bool { return atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])), sentinel, elt) || atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])), unsafe.Pointer(nil), elt) || atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])), sentinel, elt) } // Enqueue sends e on b. If there are already >=bound goroutines blocking, then // Enqueue will block until sufficiently many elements have been received. func (b *BoundedChan) Enqueue(e Elt) { b.adjust() startHead := index(atomic.LoadUint64((*uint64)(&b.q.H))) myInd := index(atomic.AddUint64((*uint64)(&b.q.T), 1) - 1) cell, cellInd := myInd.SplitInd() seg := b.q.findCell(b.tail, cell) if myInd > startHead && (myInd-startHead) > index(uint64(b.bound)) { // there is a chance that we have to block const patience = 4 for i := 0; i < patience; i++ { if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])), sentinel, unsafe.Pointer(e)) { // Between us reading startHead and now, there were enough // increments to make it the case that we should no longer // block. if debug { dbgPrint("[enq] swapped out for sentinel\n") } return } } var w interface{} = makeWeakWaiter(2) if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])), unsafe.Pointer(nil), unsafe.Pointer(Elt(&w))) { // we successfully swapped in w. No one will overwrite this // location unless they send on w first. We block. w.(*weakWaiter).Wait() // <-(w.(waitch)) if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])), unsafe.Pointer(Elt(&w)), unsafe.Pointer(e)) { if debug { dbgPrint("[enq] blocked then swapped successfully\n") } return } // someone put in a chan Elt into this location. We need to use the slow path } else if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])), sentinel, unsafe.Pointer(e)) { // Between us reading startHead and now, there were enough // increments to make it the case that we should no longer // block. if debug { dbgPrint("[enq] swapped out for sentinel\n") } return } } else { // normal case. We know we don't have to block because b.q.H can only // increase. if tryCas(seg, cellInd, unsafe.Pointer(e)) { if debug { dbgPrint("[enq] successful tryCas\n") } return } } for i := 0; ; i++ { // will run at most twice if i >= 2 { panic("[enq] bug!") } ptr := atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd]))) w := (*waiter)(ptr) w.Send(e) if debug { dbgPrint("[enq] sending to waiter on %v\n", ptr) } return } } // Dequeue receives an Elt from b. It blocks if there are no elements enqueued // there. func (b *BoundedChan) Dequeue() Elt { b.adjust() myInd := index(atomic.AddUint64((*uint64)(&b.q.H), 1) - 1) cell, segInd := myInd.SplitInd() seg := b.q.findCell(b.head, cell) // If there are Enqueuers waiting to complete due to the buffer size, we // take responsibility for waking up the thread that FA'ed b.q.H + b.bound. // If bound is zero, that is just the current thread. Otherwise we have to // do some extra work. The thread we are waking up is referred to in names // and comments as our 'buddy'. var ( bCell, bInd index bSeg *segment ) if b.bound > 0 { buddy := myInd + index(b.bound) bCell, bInd = buddy.SplitInd() bSeg = b.q.findCell(b.head, bCell) } w := makeWaiter() var res Elt if tryCas(seg, segInd, unsafe.Pointer(w)) { if debug { dbgPrint("[deq] getting res from channel %v\n", w) } res = w.Recv() } else { // tryCas failed, which means that through the "possible histories" // argument, this must be either an Elt, a waiter or a weakWaiter. It // cannot be a waiter because we are the only actor allowed to swap // one into this location. Thus it must either be a weakWaiter or an Elt. // if it is a weakWaiter, then we must send on it before casing in w, // otherwise the other thread could starve. If it is a normal Elt we // do the rest of the protocol. This also means that we can safely load // an Elt from seg, which is not always the case because sentinel is // not an Elt. // // Step 1: We failed to put our waiter into Ind. That means that either our // value is in there, or there is a weakWaiter in there. Either way these // are valid elts and we can reliably distinguish them with a type assertion elt := seg.Load(segInd) res = elt if ww, ok := (*elt).(*weakWaiter); ok { ww.Signal() if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])), unsafe.Pointer(elt), unsafe.Pointer(w)) { if debug { dbgPrint("[deq] getting res from channel slow %v\n", w) } res = w.Recv() } else { // someone cas'ed a value from a waitchan, could only have been our // friend on the dequeue side if debug { dbgPrint("[deq] getting res from load\n") } res = seg.Load(segInd) } } } for i := 0; b.bound > 0; i++ { if i >= 2 { panic("[deq] bug!") } // We have successfully gotten the value out of our cell. Now we // must ensure that our buddy is either woken up if they are // waiting, or that they will know not to sleep. // if bElt is not nil, it either has an Elt in it or a weakWater. If // it has a waitch then we need to send on it to wake up the buddy. // If it is not nill then we attempt to cas sentinel into the buddy // index. If we fail then the buddy may have cas'ed in a wait // channel so we must go again. However that will only happen once. bElt := bSeg.Load(bInd) // could this be sentinel? I don't think so.. if bElt != nil { if ww, ok := (*bElt).(*weakWaiter); ok { ww.Signal() } // there is a real queue value in bSeg.Data[bInd], therefore // buddy cannot be waiting. break } // Let buddy know that they do not have to block if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&bSeg.Data[bInd])), unsafe.Pointer(nil), sentinel) { break } } return res } ================================================ FILE: src/fchan/fchan_test.go ================================================ // Copyright 2016 Google Inc. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. package fchan import ( "reflect" "sync" "testing" ) const perThread = 256 func unorderedEltsEq(s1, s2 []int) bool { readMap := func(i []int) map[int]int { res := make(map[int]int) for _, ii := range i { res[ii]++ } return res } return reflect.DeepEqual(readMap(s1), readMap(s2)) } func TestBoundedQueueElements(t *testing.T) { const numInputs = (1 << 20) bounds := []uint64{0, 1, 1024, segSize} for _, bound := range bounds { var inputs []int var wg sync.WaitGroup for i := 0; i < numInputs; i++ { inputs = append(inputs, i) } h := NewBounded(bound) ch := make(chan int, 1024) for i := 0; i < numInputs/perThread; i++ { wg.Add(1) go func(i int) { hn := h.NewHandle() for j := 0; j < perThread; j++ { var inp interface{} = inputs[i*perThread+j] hn.Enqueue(&inp) } wg.Done() }(i) wg.Add(1) go func() { hn := h.NewHandle() for j := 0; j < perThread; j++ { out := hn.Dequeue() outInt := (*out).(int) ch <- outInt } wg.Done() }() } var outs []int for i := 0; i < numInputs; i++ { outs = append(outs, <-ch) } close(ch) if !unorderedEltsEq(outs, inputs) { t.Errorf("expected %v, got %v", inputs, outs) } wg.Wait() } } func TestQueueElements(t *testing.T) { const numInputs = 1 << 20 iters := numInputs / perThread var inputs []int var wg sync.WaitGroup for i := 0; i < numInputs; i++ { inputs = append(inputs, i) } h := New() ch := make(chan int, 1024) for i := 0; i < iters; i++ { wg.Add(1) go func(i int) { hn := h.NewHandle() for j := 0; j < perThread; j++ { var inp interface{} = inputs[i*perThread+j] hn.Enqueue(&inp) } wg.Done() }(i) wg.Add(1) go func() { hn := h.NewHandle() for j := 0; j < perThread; j++ { out := hn.Dequeue() ch <- (*out).(int) } wg.Done() }() } var outs []int for i := 0; i < numInputs; i++ { outs = append(outs, <-ch) } close(ch) if !unorderedEltsEq(outs, inputs) { t.Errorf("expected %v, got %v", inputs, outs) } wg.Wait() } func TestSerialQueue(t *testing.T) { const runs = 3*segSize + 1 h := New() var msg interface{} = "hi" for i := 0; i < runs; i++ { var m interface{} = msg h.Enqueue(&m) } for i := 0; i < runs; i++ { p := h.Dequeue() if !reflect.DeepEqual(*p, msg) { t.Errorf("expected %v, got %v", msg, *p) } } } func TestConcurrentQueueAddFirst(t *testing.T) { const runs = 3*segSize + 1 var wg sync.WaitGroup h := New() var msg interface{} = "hi" t.Logf("Spawning %v adding goroutines", runs) for i := 0; i < runs; i++ { var m interface{} = msg wg.Add(1) go func() { hn := h.NewHandle() hn.Enqueue(&m) wg.Done() }() } t.Logf("Spawning %v getting goroutines", runs) for i := 0; i < runs; i++ { wg.Add(1) go func() { hn := h.NewHandle() p := hn.Dequeue() if !reflect.DeepEqual(*p, msg) { t.Errorf("expected %v, got %v", msg, *p) } wg.Done() }() } wg.Wait() } func TestConcurrentQueueTakeFirst(t *testing.T) { const runs = 2*segSize + 1 // 4*segSize + 1 var wg sync.WaitGroup h := New() var msg interface{} = "hi" t.Logf("Spawning %v getting goroutines", runs) for i := 0; i < runs; i++ { wg.Add(1) go func() { hn := h.NewHandle() p := hn.Dequeue() if !reflect.DeepEqual(*p, msg) { t.Errorf("expected %v, got %v", msg, *p) } wg.Done() }() } t.Logf("Spawning %v adding goroutines", runs) for i := 0; i < runs; i++ { var m interface{} = msg wg.Add(1) go func() { hn := h.NewHandle() hn.Enqueue(&m) wg.Done() }() } wg.Wait() } func minN(b *testing.B) int { if b.N < 2 { return 2 } return b.N } ================================================ FILE: src/fchan/q.go ================================================ // Copyright 2016 Google Inc. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. package fchan import ( "fmt" "sync" "sync/atomic" "unsafe" ) // basic debug infrastructure const debug = false var dbgPrint = func(s string, i ...interface{}) { fmt.Printf(s, i...) } // Elt is the element type of a queue, can be any pointer type type Elt *interface{} type index uint64 type listElt *segment type waiter struct { E Elt Wgroup sync.WaitGroup } func makeWaiter() *waiter { wait := &waiter{} wait.Wgroup.Add(1) return wait } func (w *waiter) Send(e Elt) { atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(&w.E)), unsafe.Pointer(e)) w.Wgroup.Done() } func (w *waiter) Recv() Elt { w.Wgroup.Wait() return Elt(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&w.E)))) } /* type weakWaiter struct { cond *sync.Cond sync.Mutex woke int64 } func makeWeakWaiter(i int32) *weakWaiter { w := &weakWaiter{} w.cond = sync.NewCond(w) return w } func (w *weakWaiter) Signal() { w.Lock() w.woke++ w.cond.Signal() w.Unlock() } func (w *weakWaiter) Wait() { w.Lock() for w.woke == 0 { w.cond.Wait() } w.Unlock() } //*/ /* // Idea to get beyond the scalability bottleneck when number of goroutines is // much larger than gomaxprocs. Have an array of channels with large buffers // (or unbuffered channels?) and group threads into these larger groups. This // means weakWaiters are attached to queue-level state. It has the disadvantage // of making ordering a bit more difficult, as later receivers could wake up // earlier senders. I think this is fine, but it merits some thought. type weakWaiter chan struct{} func makeWeakWaiter(i int32) *weakWaiter { var ch weakWaiter = make(chan struct{}, i) return &ch } func (w *weakWaiter) Signal() { *w <- struct{}{} } func (w *weakWaiter) Wait() { <-(*w) } //*/ //* type weakWaiter struct { OSize int32 Size int32 Wgroup sync.WaitGroup } func makeWeakWaiter(i int32) *weakWaiter { wait := &weakWaiter{Size: i, OSize: i} wait.Wgroup.Add(1) return wait } func (w *weakWaiter) Signal() { newVal := atomic.AddInt32(&w.Size, -1) orig := atomic.LoadInt32(&w.OSize) if newVal+1 == orig { w.Wgroup.Done() } } func (w *weakWaiter) Wait() { w.Wgroup.Wait() } // */ // segList is a best-effort data-structure for storing spare segment // allocations. The TryPush and TryPop methods follow standard algorithms for // lock-free linked lists. They have an inconsistent length counter they // may underestimate the true length of the data-structure, but this allows // threads to bail out early. Because the slow path of allocating a new segment // in grow still works. type segList struct { MaxSpares int64 Length int64 Head *segLink } // spmcLink is a list element in a segList. Note that we cannot just re-use the // segLink next pointers without modifying the algorithm as TryPush could // potentitally sever pointers in the live queue data structure. That would // break everything. type segLink struct { Elt listElt Next *segLink } func (s *segList) TryPush(e listElt) { // bail out if list is at capacity if atomic.LoadInt64(&s.Length) >= s.MaxSpares { return } // add to length. Note that this is not atomic with respect to the append, // which means we may be under capacity on occasion. This list is only used // in a best-effort capacity, so that is okay. atomic.AddInt64(&s.Length, 1) if debug { dbgPrint("Length now %v\n", s.Length) } tl := &segLink{ Elt: e, Next: nil, } const patience = 4 i := 0 for ; i < patience; i++ { // attempt to cas Head from nil to tail, if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)), unsafe.Pointer(nil), unsafe.Pointer(tl)) { break } // try to find an empty element tailPtr := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)))) if tailPtr == nil { // if Head was switched to nil, retry continue } // advance tailPtr until it has anil next pointer for { next := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&tailPtr.Next)))) if next == nil { break } tailPtr = next } // try and add something to the end of the list if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tailPtr.Next)), unsafe.Pointer(nil), unsafe.Pointer(tl)) { break } } if i == patience { atomic.AddInt64(&s.Length, -1) } if debug { dbgPrint("Successfully pushed to segment list\n") } } func (s *segList) TryPop() (e listElt, ok bool) { const patience = 8 // it is possible that s has length <= 0 due to a temporary inconsistency // between the list itself and the length counter. See the comments in // TryPush() if atomic.LoadInt64(&s.Length) <= 0 { return nil, false } for i := 0; i < patience; i++ { hd := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)))) if hd == nil { return nil, false } // if head is not nil, try to swap it for its next pointer nxt := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&hd.Next)))) if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)), unsafe.Pointer(hd), unsafe.Pointer(nxt)) { if debug { dbgPrint("Successfully popped off segment list\n") } atomic.AddInt64(&s.Length, -1) return hd.Elt, true } } return nil, false } // segment size const segShift = 12 const segSize = 1 << segShift // The channel buffer is stored as a linked list of fixed-size arrays of size // segsize. ID is a monotonically increasing identifier corresponding to the // index in the buffer of the first element of the segment, divided by segSize // (see SplitInd). type segment struct { ID index Next *segment Data [segSize]Elt } // Load atomically loads the element at index i of s func (s *segment) Load(i index) Elt { return Elt(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Data[i])))) } // Queue is the global state of the channel. It contains indices into the head // and tail of the channel as well as a linked list of spare segments used to // avoid excess allocations. type queue struct { H index // head index T index // tail index SpareAllocs segList } // SplitInd splits i into the ID of the segment to which it refers as well as // the local index into that segment func (i index) SplitInd() (cellNum index, cellInd index) { cellNum = (i >> segShift) cellInd = i - (cellNum * segSize) return } const spare = true // grow is called if a thread has arrived at the end of the segment list but // needs to enqueue/dequeue from an index with a higher cell ID. In this case we // attempt to assign the segment's next pointer to a new segment. Allocating // segments can be expensive, so the underlying queue has a 'SpareAlloc' segment // that can be used to grow the queue, or to store unused segments that the // thread allocates. The presence of 'SpareAlloc' complicates the protocol quite // a bit, but it is wait-free (aside from memory allocation) and it will only // return if tail.Next is non-nil. func (q *queue) Grow(tail *segment) { curTail := atomic.LoadUint64((*uint64)(&tail.ID)) if spare { if next, ok := q.SpareAllocs.TryPop(); ok { atomic.StoreUint64((*uint64)(&next.ID), curTail+1) if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)), unsafe.Pointer(nil), unsafe.Pointer(next)) { return } } } newSegment := &segment{ID: index(curTail + 1)} if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)), unsafe.Pointer(nil), unsafe.Pointer(newSegment)) { if debug { dbgPrint("\t\tgrew\n") } return } if spare { // If we allocated a new segment but failed, attempt to place it in // SpareAlloc so someone else can use it. q.SpareAllocs.TryPush(newSegment) } } // advance will search for a segment with ID cell at or after the segment in // ptr, It returns with ptr either pointing to the cell in question or to the // last non-nill segment in the list. func advance(ptr **segment, cell index) { for { next := (*segment)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&(*ptr).Next)))) if next == nil || next.ID > cell { break } *ptr = next } } ================================================ FILE: src/fchan/unbounded.go ================================================ // Copyright 2016 Google Inc. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. package fchan import ( "runtime" "sync/atomic" "unsafe" ) // Thread-local state for interacting with an unbounded channel type UnboundedChan struct { // pointer to global state q *queue // pointer into last guess at the true head and tail segments head, tail *segment } // New initializes a new queue and returns an initial handle to that queue. All // other handles are allocated by calls to NewHandle() func New() *UnboundedChan { segPtr := &segment{} // 0 values are fine here q := &queue{ H: 0, T: 0, SpareAllocs: segList{MaxSpares: int64(runtime.GOMAXPROCS(0))}, } h := &UnboundedChan{ q: q, head: segPtr, tail: segPtr, } return h } // NewHandle creates a new handle for the given Queue. func (u *UnboundedChan) NewHandle() *UnboundedChan { return &UnboundedChan{ q: u.q, head: u.head, tail: u.tail, } } // Enqueue enqueues a Elt into the channel // TODO(ezrosent) enforce that e is not nil, I think we make that assumption // here.. func (u *UnboundedChan) Enqueue(e Elt) { u.adjust() // don't always do this? myInd := index(atomic.AddUint64((*uint64)(&u.q.T), 1) - 1) cell, cellInd := myInd.SplitInd() seg := u.q.findCell(u.tail, cell) if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])), unsafe.Pointer(nil), unsafe.Pointer(e)) { return } wt := (*waiter)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])))) wt.Send(e) } // findCell finds a segment at or after start with ID cellID. If one does not // yet exist, it grows the list of segments. func (q *queue) findCell(start *segment, cellID index) *segment { cur := start for cur.ID != cellID { next := (*segment)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&cur.Next)))) if next == nil { q.Grow(cur) continue } cur = next } return cur } // adjust moves h's head and tail pointers forward if H and T point to a newer // segment. The loads and moves do not need to be atomic because H and T only // ever increase in value. Calling this regularly is probably good for // performance, and is necessary to ensure that old segments are garbage // collected. func (u *UnboundedChan) adjust() { H := index(atomic.LoadUint64((*uint64)(&u.q.H))) T := index(atomic.LoadUint64((*uint64)(&u.q.T))) cellH, _ := H.SplitInd() advance(&u.head, cellH) cellT, _ := T.SplitInd() advance(&u.tail, cellT) } // Dequeue an element from the channel, will block if nothing is there func (u *UnboundedChan) Dequeue() Elt { u.adjust() myInd := index(atomic.AddUint64((*uint64)(&u.q.H), 1) - 1) cell, cellInd := myInd.SplitInd() seg := u.q.findCell(u.head, cell) elt := seg.Load(cellInd) wt := makeWaiter() if elt == nil && atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])), unsafe.Pointer(nil), unsafe.Pointer(wt)) { if debug { dbgPrint("\t[deq] slow path\n") } return wt.Recv() } return seg.Load(cellInd) } ================================================ FILE: writeup/graphs.py ================================================ # Copyright 2016 Google Inc. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """ This is a basic script that parses the output of fchan_main and renders the graphs for goroutines=GOMAXPROCS and goroutines=5000 """ import numpy as np import matplotlib.pyplot as plt import seaborn as unused_import import re import sys class BenchResult(object): def __init__(self, name, gmp, max_hw_thr, nops, secs): self.name = name self.gmp = gmp self.max_hw_thr = int(max_hw_thr) # Millions of operations per second self.tp = float(nops) / (float(secs) * 1e6) def parse_line(line): m_gmp = re.match(r'^([^\-]*)GMP-(\d+)\s+(\d+)\s+([^\s]*)s\s*$', line) m2 = re.match(r'^([^\-]*)-(\d+)\s+(\d+)\s+([^\s]*)s\s*$', line) if m_gmp is not None: name, threads, nops, secs = m_gmp.groups() return BenchResult(name, True, threads, nops, secs) if m2 is not None: name, threads, nops, secs = m2.groups() return BenchResult(name, False, threads, nops, secs) print line, 'did not match anything' return None def plot_points(all_results, gmp): series = sorted(list({k.name for k in all_results if k.gmp == gmp})) for k in series: results = [r for r in all_results if r.gmp == gmp and r.name == k] points = sorted((r.max_hw_thr, r.tp) for r in results) plt.xlabel(r'GOMAXPROCS') plt.ylabel('Ops / second (millions)') X = np.array([x for (x, y) in points]) Y = np.array([y for (x, y) in points]) plt.plot(X, Y, label=k) plt.scatter(X, Y) plt.legend() def main(fname): with open(fname) as f: results = [p for p in (parse_line(line) for line in f) if p is not None] print 'Generating non-GMP graph' plt.title('5000 Goroutines') plot_points(results, False) plt.savefig('contend_graph.pdf') plt.clf() print 'Generating GMP graph' plt.title('Goroutines Equal to GOMAXPROCS') plot_points(results, True) plt.savefig('gmp_graph.pdf') plt.clf() if __name__ == '__main__': main(sys.argv[1]) ================================================ FILE: writeup/latex.template ================================================ \documentclass[$if(fontsize)$$fontsize$,$endif$$if(lang)$$babel-lang$,$endif$$if(papersize)$$papersize$,$endif$$for(classoption)$$classoption$$sep$,$endfor$]{$documentclass$} $if(fontfamily)$ \usepackage{$fontfamily$} $else$ %\usepackage{lmodern} $endif$ $if(linestretch)$ \usepackage{setspace} \setstretch{$linestretch$} $endif$ \usepackage{amssymb,amsmath} \usepackage{ifxetex,ifluatex} \usepackage{fixltx2e} % provides \textsubscript \ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} $if(euro)$ \usepackage{eurosym} $endif$ \else % if luatex or xelatex \ifxetex \usepackage{mathspec} \usepackage{xltxtra,xunicode} \else \usepackage{fontspec} \fi \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase} \newcommand{\euro}{€} $if(mainfont)$ \setmainfont{$mainfont$} $endif$ $if(sansfont)$ \setsansfont{$sansfont$} $endif$ $if(monofont)$ \setmonofont[Mapping=tex-ansi]{$monofont$} $endif$ $if(mathfont)$ \setmathfont(Digits,Latin,Greek){$mathfont$} $endif$ $if(CJKmainfont)$ \usepackage{xeCJK} \setCJKmainfont[$CJKoptions$]{$CJKmainfont$} $endif$ \fi % use upquote if available, for straight quotes in verbatim environments \IfFileExists{upquote.sty}{\usepackage{upquote}}{} % use microtype if available \IfFileExists{microtype.sty}{% \usepackage{microtype} \UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts }{} $if(geometry)$ \usepackage[$for(geometry)$$geometry$$sep$,$endfor$]{geometry} $endif$ \ifxetex \usepackage[setpagesize=false, % page size defined by xetex unicode=false, % unicode breaks when used with xetex xetex]{hyperref} \else \usepackage[unicode=true]{hyperref} \fi \usepackage[usenames,dvipsnames]{color} \hypersetup{breaklinks=true, bookmarks=true, pdfauthor={$author-meta$}, pdftitle={$title-meta$}, colorlinks=true, citecolor=$if(citecolor)$$citecolor$$else$blue$endif$, urlcolor=$if(urlcolor)$$urlcolor$$else$blue$endif$, linkcolor=$if(linkcolor)$$linkcolor$$else$magenta$endif$, pdfborder={0 0 0}} \urlstyle{same} % don't use monospace font for urls $if(lang)$ \ifxetex \usepackage{polyglossia} \setmainlanguage[variant=$polyglossia-variant$]{$polyglossia-lang$} \setotherlanguages{$for(polyglossia-otherlangs)$$polyglossia-otherlangs$$sep$,$endfor$} \else \usepackage[shorthands=off,$babel-lang$]{babel} \fi $endif$ $if(natbib)$ \usepackage{natbib} \bibliographystyle{$if(biblio-style)$$biblio-style$$else$plainnat$endif$} $endif$ $if(biblatex)$ \usepackage{biblatex} $for(bibliography)$ \addbibresource{$bibliography$} $endfor$ $endif$ $if(listings)$ \usepackage{listings} $endif$ $if(lhs)$ \lstnewenvironment{code}{\lstset{language=Haskell,basicstyle=\small\ttfamily}}{} $endif$ $if(highlighting-macros)$ $highlighting-macros$ $endif$ $if(verbatim-in-note)$ \usepackage{fancyvrb} \VerbatimFootnotes $endif$ $if(tables)$ \usepackage{longtable,booktabs} $endif$ $if(graphics)$ \usepackage{graphicx,grffile} \makeatletter \def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi} \def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi} \makeatother % Scale images if necessary, so that they will not overflow the page % margins by default, and it is still possible to overwrite the defaults % using explicit options in \includegraphics[width, height, ...]{} \setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio} $endif$ $if(links-as-notes)$ % Make links footnotes instead of hotlinks: \renewcommand{\href}[2]{#2\footnote{\url{#1}}} $endif$ $if(strikeout)$ \usepackage[normalem]{ulem} % avoid problems with \sout in headers with hyperref: \pdfstringdefDisableCommands{\renewcommand{\sout}{}} $endif$ \setlength{\parindent}{0pt} \setlength{\parskip}{6pt plus 2pt minus 1pt} \setlength{\emergencystretch}{3em} % prevent overfull lines \providecommand{\tightlist}{% \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}} $if(numbersections)$ \setcounter{secnumdepth}{5} $else$ \setcounter{secnumdepth}{0} $endif$ $if(verbatim-in-note)$ \VerbatimFootnotes % allows verbatim text in footnotes $endif$ $if(dir)$ \ifxetex % load bidi as late as possible as it modifies e.g. graphicx $if(latex-dir-rtl)$ \usepackage[RTLdocument]{bidi} $else$ \usepackage{bidi} $endif$ \fi \ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex \TeXXeTstate=1 \newcommand{\RL}[1]{\beginR #1\endR} \newcommand{\LR}[1]{\beginL #1\endL} \newenvironment{RTL}{\beginR}{\endR} \newenvironment{LTR}{\beginL}{\endL} \fi $endif$ $if(title)$ \title{$title$$if(subtitle)$\\\vspace{0.5em}{\large $subtitle$}$endif$} $endif$ $if(author)$ \usepackage{fancyhdr} \fancypagestyle{plain}{} \pagestyle{fancy} \fancyhead[LO,RE]{\large $for(author)$$author.name$$sep$ \and $endfor$} %\author{$for(author)$$author.name$$sep$ \and $endfor$} $endif$ \date{$date$} $for(header-includes)$ $header-includes$ $endfor$ % Redefines (sub)paragraphs to behave more like sections \ifx\paragraph\undefined\else \let\oldparagraph\paragraph \renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}} \fi \ifx\subparagraph\undefined\else \let\oldsubparagraph\subparagraph \renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} \fi \begin{document} $if(title)$ \maketitle $endif$ $if(abstract)$ \begin{abstract} $abstract$ \end{abstract} $endif$ $for(include-before)$ $include-before$ $endfor$ $if(toc)$ { \vspace{-0.9in} \hypersetup{linkcolor=$if(toccolor)$$toccolor$$else$black$endif$} \setcounter{tocdepth}{$toc-depth$} \tableofcontents } $endif$ $if(lot)$ \listoftables $endif$ $if(lof)$ \listoffigures $endif$ $body$ $if(natbib)$ $if(bibliography)$ $if(biblio-title)$ $if(book-class)$ \renewcommand\bibname{$biblio-title$} $else$ \renewcommand\refname{$biblio-title$} $endif$ $endif$ \bibliography{$for(bibliography)$$bibliography$$sep$,$endfor$} $endif$ $endif$ $if(biblatex)$ \printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$ $endif$ $for(include-after)$ $include-after$ $endfor$ \end{document} ================================================ FILE: writeup/refs.bib ================================================ @inproceedings{wfq, title={A wait-free queue as fast as fetch-and-add}, author={Yang, Chaoran and Mellor-Crummey, John}, booktitle={Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, pages={16}, year={2016}, organization={ACM} } @inproceedings{lcrq, title={Fast concurrent queues for x86 processors}, author={Morrison, Adam and Afek, Yehuda}, booktitle={ACM SIGPLAN Notices}, volume={48}, number={8}, pages={103--112}, year={2013}, organization={ACM} } @incollection{CSP, title={Communicating sequential processes}, author={Hoare, Charles Antony Richard}, booktitle={The origin of concurrent programming}, pages={413--443}, year={1978}, publisher={Springer} } @book{tgpl, author = {Donovan, Alan A.A. and Kernighan, Brian W.}, title = {The Go Programming Language}, year = {2015}, isbn = {0134190440, 9780134190440}, edition = {1st}, publisher = {Addison-Wesley Professional}, } @inproceedings{MSQueue, title={Simple, fast, and practical non-blocking and blocking concurrent queue algorithms}, author={Michael, Maged M and Scott, Michael L}, booktitle={Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing}, pages={267--275}, year={1996}, organization={ACM} } @article{herlihyBook, title={The Art of Multiprocessor Programming}, author={Herlihy, Maurice and Shavit, Nir}, year={2008}, publisher={Morgan Kaufmann Publishers Inc.} } @online{GoSpec, title = {The Go Programming Language Specification}, howpublished = {\url{https://golang.org/ref/spec}}, year = {2009}, urldate = {2016-10-30} } @inproceedings{FastSlow, title={A methodology for creating fast wait-free data structures}, author={Kogan, Alex and Petrank, Erez}, booktitle={ACM SIGPLAN Notices}, volume={47}, number={8}, pages={141--150}, year={2012}, organization={ACM} } @article{wfsync, title={Wait-free synchronization}, author={Herlihy, Maurice}, journal={ACM Transactions on Programming Languages and Systems (TOPLAS)}, volume={13}, number={1}, pages={124--149}, year={1991}, publisher={ACM} } @incollection{marlowPar, title={Parallel and concurrent programming in Haskell}, author={Marlow, Simon}, booktitle={Central European Functional Programming School}, pages={339--401}, year={2012}, publisher={Springer} } @article{herlihyLinear, title={Linearizability: A correctness condition for concurrent objects}, author={Herlihy, Maurice P and Wing, Jeannette M}, journal={ACM Transactions on Programming Languages and Systems (TOPLAS)}, volume={12}, number={3}, pages={463--492}, year={1990}, publisher={ACM} } ================================================ FILE: writeup/writeup.md ================================================ --- title: Faster Channels in Go (Work in Progress) subtitle: Scaling Blocking Channels with Techniques from Nonblocking Data-Structures. toc: true link-citations: true geometry: ['margin=1in'] fontsize: 11pt author: name: Eli Rosenthal --- # Introduction Channels in the [Go](https://golang.org/) language are a common way to structure concurrent code. The channel API in Go is intended to support programming in the manner described by CSP [see @CSP, the original paper; also the preface of @tgpl for CSP's relationship to Go]. Channels in Go have a fixed buffer size $b$ such that only $b$ senders may return without having handed a value off to a corresponding receiver. Here is some basic pseudocode[^pseudo] for the send and receive operations[^select], though it is worth referring to the spec @GoSpec as well. ~~~~ send(c: chan T, item: T) receive(c: chan T) -> T atomically do: atomically do: if the buffer is full begin: block if there are items in the buffer append an item to the buffer result = head of buffer if there were any receivers blocked advance the buffer head wake the first one up if there are any senders waiting wake the first sender up return result if the buffer is empty block goto begin ~~~~ Go channels currently require goroutines[^goroutine] to acquire a single lock before performing additional operations[^chanimp]. This makes contention for this lock a scalability bottleneck; while acquiring a mutex can be very fast this means that only one thread can perform an operation on a queue at a time. This document describes the implementation of a novel channel algorithm that permits different sends and receives to complete in parallel. We will start with a review of recent literature on non-blocking queues. Then we will move onto describing the implementation of a fast *unbounded* channel in Go; this algorithm may be of independent interest. Finally we will extend this design to provide the bounded semantics of Go channels. We will also report performance measurements for these algorithms. # Non-blocking Queues The standard data-structure closest to the notion of unbounded channel is that of a FIFO queue. A queue supports enqueue and dequeue operations, where it is common for dequeue to be allowed to fail if there are no elements in the queue. There are myriad algorithms for concurrent queues which provide different guarantees in terms of progress and consistency [see @herlihyBook Chapter 10 for an overview], but we will focus here on *non-blocking* queues because of the approach in that literature to making scalable concurrent data-structures. Informally, we say a data-structure is non-blocking if no thread can perform an operation that will require it to block any other threads for an unbounded amount of time. As a result, no queue that requires a thread to take a lock can be non-blocking: one thread can acquire a lock and then be de-scheduled for an arbitrary amount of time and thereby block all other threads contending for the lock. Non-blocking algorithms generally use atomic instructions like Compare-And-Swap (CAS) to avoid different threads stepping on one another's toes [see @herlihyBook Chapter 3 for a tutorial on atomic synchronization primitives]. Non-blocking operations can exhibit a number of additional progress guarantees: * **Obstruction Freedom** If there is only one thread executing an operation, that operation will complete in a finite number of steps. * **Lock Freedom** Regardless of the number of threads executing an operation concurrently, at least one thread will complete the operation in a finite number of steps. * **Wait Freedom** Any thread executing an operation is guaranteed to finish in a finite number of steps. Non-blocking synchronization is not a panacea. The fact that there are hard upper bounds on how long it will take for a thread to complete an operation does not imply that the algorithm will perform better in practice. While wait-free data-structures are important for some embedded or real-time systems that need these strong guarantees, there are often blocking algorithms which perform better in terms of throughput than their lock-free or wait-free counterparts[^combine]. Still, non-blocking algorithms can shine in high-contention settings. A small number of CAS operations can amount to less overhead than aquiring a lock, and more fine-grained concurrency coupled with progress guarantees *can* reduce contention[^msqueue]. # Using Fetch-and-Add to Reduce Contention The atomic Fetch-and-Add (F&A) instruction adds a value to an integer and returns the old or new value of that integer. Here are the basic semantics of the operation in Go[^fasem]: ```go //atomically func AtomicFetchAdd(src *int, delta int) { *src += delta return *src } ``` While hardware support for a F&A instruction is not as universal as that of CAS, F&A is implemented on x86. On modern x86 machines, F&A is much faster than CAS [see @lcrq for performance measurements], and it always succeeds. This has the dual effect allowing code making judicious use of F&A to be both efficient and easier to reason about than equivalents that rely only on CAS. A common pattern exemplifying this idea is to first use F&A to acquire an index into an array, and then to use more conventional techniques to write to that index. This is helpful because it can reduce contention on individual locations for a data-structure. ## A Non-blocking Queue From an Infinite Array To illustrate this, we will write two non-blocking queues in pseudo-Go based on an infinite array [`Queue2` is based on the obstruction-free queue presented in pseudo-code in @wfq, `Queue1` is a CAS-ification of that design]. Both of these designs make use of the fact that head and tail pointers *only ever increase*. ~~~~ {.go .numberLines } type Queue1 struct { head, tail *T data [∞]T } func (q *Queue1) Enqueue(elt T) { for { newTail := atomic.LoadPointer(&q.tail) + 1 if atomic.CompareAndSwapT(newTail, nil, elt) { atomic.CompareAndSwap(&q.tail, q.tail, newTail) break } } } func (q *Queue1) Dequeue() T { for { curHead := atomic.LoadPointer(&q.head) curTail := atomic.LoadPointer(&q.tail) if curHead == curTail { return nil } if atomic.CompareAndSwapPointer(&q.head, curHead, curHead+1) { return *curHead } } } ~~~~ The second queue will assume that the type `T` can not only take on a `nil` value but also an unambiguous `SENTINEL` value that a user is guaranteed not to pass in to `Enqueue`. This value is used to mark an index as unusable, signalling a conflicting `Enqueue` thread that it should try again. ~~~~ {.go .numberLines startFrom="26"} type Queue2 struct { head, ta uint data [∞]T } func (q *Queue2) Enqueue(elt T) { for { myTail := atomic.AddUint(&q.tail) - 1 if atomic.CompareAndSwapT(&q.data[myTail], nil, elt) { break } } } func (q *Queue2) Dequeue() T { for { myHead := atomic.AddUint(&q.head) - 1 curTail := atomic.LoadUint(&q.tail) if !atomic.CompareAndSwapPointer(&q.data[myHead], nil, SENTINEL) { return atomic.LoadT(&q.data[myHead]) } if myHead == curTail { return nil } } } ~~~~ The core algorithm for both `Queue1` and `Queue2` is essentially the same. Enqueueing threads load a view of the tail pointer and try to CAS their element in one element after that pointer; dequeueing threads perform a symmetric operation to advance the head pointer. The practical (that is, practical for algorithms that require a infinite amount of memory) difference between `Queue1` and `Queue2` is that `Queue2` first has threads perform an atomic increment of a head or tail index. This means that two concurrent enqueue operations will always attempt a CAS on *different* queue elements. As a result, enqueue operations need only concern themselves with dequeue operations that increment `head` to the same value as their `myTail` (lines 33--34). A downside of this approach is that while `Queue1` is lock free, `Queue2` is merely obstruction free. For an enqueue/dequeue pair of threads, each can continually increment equal `head` and `tail` indices while the dequeuer's CAS (line 44) always succeeds before the enqueuer's (line 34) resulting in livelock[^livelockdef]. ## Lessons for Channels The `Queue2` above is the core of the implementation of a fast wait-free queue in @wfq. It is also the basic idea that we will leverage when designing a more scalable channel. The rest of their algorithm consists in solving three problems that have analogs in our setting. (1) *Simulating an infinite array with a finite amount of memory.* Here the authors implement a linked list of fixed-length arrays (called segments, or cells); threads grow this array when more space is required. (2) *Going from obstruction freedom to wait freedom.* This involves attempting either `Dequeue` or `Enqueue` above for a constant number of iterations, followed by a slow path which involves implementing a helping mechanism[^helping] to help contending threads to finish their outstanding operations. (3) *Memory Reclamation.* Reclaiming memory in a non-blocking setting is, perhaps unsurprisingly, a very fraught task. While the solution to (3) in this paper is interesting and efficient, we will (mercifully) be relying on Go's garbage collection mechanism to solve this problem. For (1) we will employ essentially the same algorithm as the paper, but with additional optimizations for memory allocation. For (2) our slow path will implement the blocking semantics of a channel. # An Unbounded Channel With Low Contention We first consider the case of implementing an unbounded channel. While this channel is blocking --- Go channels must in some capacity be blocking as they provide a synchronization mechanism --- it only blocks when it has to (i.e. for receives that do not yet have a corresponding send), and when it does progress is impeded for at most 2 threads, the components of a send/receive pair. We will start with the types: ~~~ {.go } type Elt *interface{} type index uint64 // segment size const segShift = 12 const segSize = 1 << segShift // The channel buffer is stored as a linked list of fixed-size arrays of size // segsize. ID is a monotonically increasing identifier corresponding to the // index in the buffer of the first element of the segment, divided by segSize // (see SplitInd). type segment struct { ID index // index of Data[0] / segSize Next *segment Data [segSize]Elt } // Load atomically loads the element at index i of s func (s *segment) Load(i index) Elt { return Elt(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Data[i])))) } // Queue is the global state of the channel. It contains indices into the head // and tail of the channel as well as a linked list of spare segments used to // avoid excess allocations. type queue struct{ H, T index } // Thread-local state for interacting with an unbounded channel type UnboundedChan struct { // pointer to global state q *queue // pointer into last guess at the true head and tail segments head, tail *segment } ~~~~ The only data-structure-global global state that we employ is the `queue` structure which maintains the head and tail indices. Pointers into the data itself are kept locally in an `UnboundedChan` for two reasons (1) It reduces any possible contention resulting from updated shared head or tail pointers. (2) If individual threads all update local head and tail pointers, then the garbage collector will be able to clean up used segments when (and only when) all threads no longer hold a reference to them. We note that a downside of this design is that inactive threads that hold such a handle can cause space leaks by holding onto references to long-dead segments. Users interact with a channel by first creating an initial value, and later cloning that value and others derived from it using `NewHandle`. ~~~~ {.go } // New initializes a new queue and returns an initial handle to that queue. All // other handles are allocated by calls to NewHandle() func New() *UnboundedChan { segPtr := &segment{} // 0 values are fine here q := &queue{H: 0, T: 0} h := &UnboundedChan{q: q, head: segPtr, tail: segPtr} return h } // NewHandle creates a new handle for the given Channel func (u *UnboundedChan) NewHandle() *UnboundedChan { return &UnboundedChan{q: u.q, head: u.head, tail: u.tail} } ~~~~ ## Sending and Receiving The key enqueue (or send) algorithm is to atomically increment the \texttt{T} index, attempt to CAS in the item, and to wake up a blocking thread if the CAS fails. We will begin with the `Enqueue` code and then explain the code that it calls. ~~~~ {.go .numberLines} func (u *UnboundedChan) Enqueue(e Elt) { u.adjust() myInd := index(atomic.AddUint64((*uint64)(&u.q.T), 1) - 1) cell, cellInd := myInd.SplitInd() seg := u.q.findCell(u.tail, cell) if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])), unsafe.Pointer(nil), unsafe.Pointer(e)) { return } wt := (*waiter)(atomic.LoadPointer( (*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])))) wt.Send(e) } func (u *UnboundedChan) Dequeue() Elt { u.adjust() myInd := index(atomic.AddUint64((*uint64)(&u.q.H), 1) - 1) cell, cellInd := myInd.SplitInd() seg := u.q.findCell(u.head, cell) elt := seg.Load(cellInd) wt := makeWaiter() if elt == nil && atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])), unsafe.Pointer(nil), unsafe.Pointer(wt)) { return wt.Recv() } return seg.Load(cellInd) } ~~~~ The `adjust` (line 2) method atomically loads `H` and `T`, then advances `u.head` and `u.tail` to point to their cells. The `AtomicAdd` on line 3 acquires an index into the queue. `SplitInd` (line 4) returns the cell ID and the index into that cell corresponding to `myInd`. As `T` can only increase, the only possible thread that could also be contending for this item is a `Dequeue`ing thread that acquired `H` as the same value as `myInd`. So it comes down to the CASes on lines 6 and 21--22. If the first CAS fails, it means a `Dequeue` thread has swapped in a `waiter`, if it succeeds then it means an `Enqueue`r can return and a contending `Dequeue`r can just load the value in `cellInd`. ## Blocking So what is a `waiter`? It acts like a channel with buffer size 1, or an *MVar* in the Haskell community [see Chapter 7 of @marlowPar for an introduction], that can only tolerate 1 element being sent on it. We currently implement this in terms of a single value and a `WaitGroup`. `WaitGroup`s in Go's `sync` package allow goroutines to `Add` an integer value to the `WaitGroup`'s counter and to `Wait` for that counter to reach zero. If the counter goes below zero, the current `WaitGroup` implementation panics, which is helpful for debugging purposes as there should only ever be one `Send` or `Recv` on a `waiter` here. ~~~~ {.go} type waiter struct { func makeWaiter() *waiter { E Elt wait := &waiter{} Wgroup sync.WaitGroup wait.Wgroup.Add(1) } return wait } func (w *waiter) Send(e Elt) { atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(&w.E)), unsafe.Pointer(e)) w.Wgroup.Done() // The Done method just calls Add(-1) } func (w *waiter) Recv() Elt { w.Wgroup.Wait() return Elt(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&w.E)))) } ~~~~ There are two important parts of our strategy to implement blocking. Neither Enqueuers nor Dequeuers will block at all if Enqueuers complete before Dequeuers begin. In fact the only global synchronization they must perform is a single F&A and a single *uncontended* CAS (unless they must grow the queue; see below). Second, if a Enqueuer does not arrive soon enough and must block on a `waiter`, there will be essentially no contention for the waiter because there can only be one other threads interacting with it. ## Growing the Queue and Allocation We will now describe the implementation of the `findCell` method. The algorithm is to start at a given segment pointer, and to follow that segment's `Next` pointer until that segment's `ID` is equal to a given `cell` index. If `findCell` reaches the end of the list of segments before it reaches the correct index, it attempts to allocate a new segment and place it onto the end of the list. Here is some code: ~~~~ {.go .numberLines} func (q *queue) findCell(start *segment, cellID index) *segment { cur := start for cur.ID != cellID { next := (*segment)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&cur.Next)))) if next == nil { q.Grow(cur) continue } cur = next } return cur } func (q *queue) Grow(tail *segment) { curTail := atomic.LoadUint64((*uint64)(&tail.ID)) newSegment := &segment{ID: index(curTail + 1)} if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)), unsafe.Pointer(nil), unsafe.Pointer(newSegment)) { return } } ~~~~ Note that we can get away with performing a single CAS operation in `Grow` because if our CAS failed we know someone else succeeded, and a new segment with ID of `tail.ID+1` is the only possible value that could be placed there. However, there *is* a problem with this implementation: it is extremely wasteful. In a high-contention situation, it is possible for many threads to all allocate a new segment, but only one of those threads will succeed. Any failed allocations will become immediately unreachable and will hence be garbage collected. In our experiments, channel operations are fastest when segments have a size of $\geq 1024$, so any wasted allocation can have a tangible impact on throughput. This slowdown was evident in our performance measurements. Our solution to this problem is to keep a lock-free linked list of segments in the `queue` structure. Threads in `Grow` first try and pop a segment off of this list, and then perform the CAS. Only if this pop fails do they allocate a new segment. Symmetrically, if a CAS fails then threads attempt to push a segment onto this list. The list keeps a best-effort counter representing its length and does not allow this counter to grow past a maximum length; this allows us to avoid a space leak in the implementation of the queue. For a full implementation of `Grow`, see Appendix A. # Extending to the Bounded Case Go channels do not have an unbounded variant. While the structure offered above is potentially useful, there are good reasons to prefer bounded channels in some settings[^bounded]. Unbuffered channels allow for a more synchronous programming model that is common in Go to synchronize two cooperating threads; this level of synchronization is useful to have. This section describes the implementation of a bounded channel on top of the unbounded implementation above. ## Preliminaries We re-use the `q` and `segment` types, along with the `findCell` and `Grow` machinery. Almost all of the difference is in the new `Enqueue` and `Dequeue` operations. These are, however, significantly more complex. This complexity is the result of senders and receivers being given new responsibilities: * Senders must decide if they should block and wait for more receivers to arrive. * Receivers have to wake up any waiters who ought to wake up if they succeed in popping an element off of the queue. As before, this protocol is implemented in a manner that avoids blocking unless blocking is required by the channel semantics. This means `Enqueue` and `Dequeue` methods must consider arbitrary interleavings of the unbounded channel protocol and the new blocking protocol. The `BoundedChan` has an additional integer field `bound` indicating the maximum number of senders permitted to return without having rendezvoused with a receiver. We also introduce an immutable global `sentinel` pointer used by receiving threads to signal that a sender should not block. A consequence of this design is that now all places that required a CAS from `nil` to another value must also attempt to CAS from `sentinel`. We maintain the invariant that no value will transition from `sentinel` back to `nil`, so the `tryCas` function below guarantees that `seg.Data[segInd]` is neither `nil` nor `sentinel` when it returns (unless `e` is either of those). ## (Aside) Possible Histories of an Element in a Segment In the unbounded case, there were essentially two possible histories of a value in the queue: |Events | History | |----------------------|------------------------| |Sender, Receiver | `nil` $\to$ `Elt` | |Receiver, Sender | `nil` $\to$ `*waiter` | This can be viewed as the key invariant that is enforced in the implementation of unbounded channels. There are more histories in the bounded case. These (and only these) can all arise --- keeping this in mind is helpful for understanding the protocol: --------------------------------------------------------------------------------------------- Events History --------------------------------------------------- ---------------------------------------- Sender, Receiver `nil` $\to$ `Elt` Receiver, Sender `nil` $\to$ `*waiter` Waker, Sender, Receiver `nil` $\to$ `sentinel` $\to$ `Elt` Waker, Receiver, Sender `nil` $\to$ `sentinel` $\to$ `*waiter` $\textrm{Sender}^\dagger$, Waker, Sender, Receiver `nil` $\to$ `*weakWaiter` $\to$ `Elt` $\textrm{Sender}^\dagger$, Waker, Receiver, Sender `nil` $\to$ `*weakWaiter` $\to$ `*waiter` --------------------------------------------------- ---------------------------------------- Where $\textrm{Sender}^\dagger$ denotes that a sender arrives but must block for more receivers to complete, and a Waker is any thread that successfully wakes up a blocked Sender. The details of what a `weakWaiter` is and who exactly plays the role of "Waker" are covered in the following sections. ## Enqueue We first present the source of `tryCas` and `Enqueue`: ~~~~ {.go .numberLines} func tryCas(seg *segment, segInd index, elt unsafe.Pointer) bool { return atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])), unsafe.Pointer(nil), elt) || atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])), sentinel, elt) } // Enqueue sends e on b. If there are already >=bound goroutines blocking, then // Enqueue will block until sufficiently many elements have been received. func (b *BoundedChan) Enqueue(e Elt) { b.adjust() startHead := index(atomic.LoadUint64((*uint64)(&b.q.H))) myInd := index(atomic.AddUint64((*uint64)(&b.q.T), 1) - 1) cell, cellInd := myInd.SplitInd() seg := b.q.findCell(b.tail, cell) if myInd > startHead && (myInd-startHead) > index(uint64(b.bound)) { // there is a chance that we have to block var w interface{} = makeWeakWaiter(2) if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])), unsafe.Pointer(nil), unsafe.Pointer(Elt(&w))) { // we successfully swapped in w. No one will overwrite this // location unless they send on w first. We block. w.(*weakWaiter).Wait() if atomic.CompareAndSwapPointer( (*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])), unsafe.Pointer(Elt(&w)), unsafe.Pointer(e)) { return } // someone put a waiter into this location. We need to use the slow path } else if atomic.CompareAndSwapPointer( (*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])), sentinel, unsafe.Pointer(e)) { // Between us reading startHead and now, there were enough // increments to make it the case that we should no longer // block. return } } else { // normal case. We know we don't have to block because b.q.H can only // increase. if tryCas(seg, cellInd, unsafe.Pointer(e)) { return } } ptr := atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd]))) w := (*waiter)(ptr) w.Send(e) return } ~~~~ `Enqueue` starts by loading a value of `H` and then acquiring `myInd`. Note that this *is not* a consistent snapshot of the state of the queue, as `H` could have moved between loading it and incrementing `myInd` (lines 12--13). However, `H` will only increase! If `startHead` is within `b.bound` of `myInd` it means that `H` is at most that far behind `T` was when we performed the increment. In that case we can simply attempt to CAS in `e` (line 40). If that fails, it can only mean that a receiver has placed a `waiter` in this index, so we wake up the receiver and return (lines 44--46). If there is a chance that we *do* have to block, then we allocate a new `weakWaiter`. A `weakWaiter` is like a `waiter` except it does not contain a value, but it does allow for more than one message to be received. There are many ways to implement such a construct in Go, here is an implementation in terms of a `WaitGroup`: ```go type weakWaiter struct { OSize, Size int32 Wgroup sync.WaitGroup } func makeWeakWaiter(i int32) *weakWaiter { wait := &weakWaiter{Size: i, OSize: i} wait.Wgroup.Add(1) return wait } func (w *weakWaiter) Signal() { newVal := atomic.AddInt32(&w.Size, -1) orig := atomic.LoadInt32(&w.OSize) if newVal+1 == orig { w.Wgroup.Done() } } ``` In the that case we may block, we construct a `weakWaiter` with a buffer size of two because it is possible to have two dequeueing threads concurrently attempt to wake up an enqueueing thread (see below). If the sender successfully CASes `w` into the proper location (line 19), then it waits and attempts the rest of the unbounded channel protocol when it wakes. There are two possible scenarios if this CAS fails: (1) A receiver for `b.bound` elements forward in the channel attempted to wake up this sender, but arrived before `w` was stored. (2) A receiver has already started waiting at this location The CAS on line 29 determines which case this is. If (1) then the CAS will fail and the sender must now wake up the waiting receiver thread on line 46. If (2) is the case then the CAS will succeed and `e` will successfully be in the queue. ## Dequeue The `Dequeue` implementation effectively mirrors the `Enqueue` implementation. There are, however, a few things that are especially subtle. Let's start with the implementation: ~~~~ {.go .numberLines startFrom="49"} func (b *BoundedChan) Dequeue() Elt { b.adjust() myInd := index(atomic.AddUint64((*uint64)(&b.q.H), 1) - 1) cell, segInd := myInd.SplitInd() seg := b.q.findCell(b.head, cell) // If there are Enqueuers waiting to complete due to the buffer size, we // take responsibility for waking up the thread that FA'ed b.q.H + b.bound. // If bound is zero, that is just the current thread. Otherwise we have to // do some extra work. The thread we are waking up is referred to in names // and comments as our 'buddy'. var ( bCell, bInd index bSeg *segment ) if b.bound > 0 { buddy := myInd + index(b.bound) bCell, bInd = buddy.SplitInd() bSeg = b.q.findCell(b.head, bCell) } w := makeWaiter() var res Elt if tryCas(seg, segInd, unsafe.Pointer(w)) { res = w.Recv() } else { // tryCas failed, which means that through the "possible histories" // argument, this must be either an Elt, a waiter or a weakWaiter. It // cannot be a waiter because we are the only actor allowed to swap // one into this location. Thus it must either be a weakWaiter or an Elt. // if it is a weakWaiter, then we must send on it before casing in w, // otherwise the other thread could starve. If it is a normal Elt we // do the rest of the protocol. This also means that we can safely load // an Elt from seg, which is not always the case because sentinel is // not an Elt. // Step 1: We failed to put our waiter into Ind. That means that either our // value is in there, or there is a weakWaiter in there. Either way these // are valid elts and we can reliably distinguish them with a type assertion elt := seg.Load(segInd) res = elt if ww, ok := (*elt).(*weakWaiter); ok { ww.Signal() if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])), unsafe.Pointer(elt), unsafe.Pointer(w)) { res = w.Recv() } else { // someone cas'ed a value from a weakWaiter, could only have been our // friend on the dequeue side res = seg.Load(segInd) } } } for b.bound > 0 { // runs at most twice // We have successfully gotten the value out of our cell. Now we // must ensure that our buddy is either woken up if they are // waiting, or that they will know not to sleep. // if bElt is not nil, it either has an Elt in it or a weakWater. If // it has a waitch then we need to send on it to wake up the buddy. // If it is not nill then we attempt to cas sentinel into the buddy // index. If we fail then the buddy may have cas'ed in a wait // channel so we must go again. However that will only happen once. bElt := bSeg.Load(bInd) if bElt != nil { if ww, ok := (*bElt).(*weakWaiter); ok { ww.Signal() } // there is a real queue value in bSeg.Data[bInd], therefore // buddy cannot be waiting. break } // Let buddy know that they do not have to block if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&bSeg.Data[bInd])), unsafe.Pointer(nil), sentinel) { break } } return res } ~~~~ Now the subtleties. A dequeuer may have to wake up multiple waiting send threads: the one waiting at `myInd` and the other waiting at `myInd + bound` (or `bInd`). This may seem strange because the dequeuer that receives `myInd-bound` ought to have woken up any pending senders. The issue is that *we have no gaurantee that this dequeuer has returned*. The possibility of this occurring is remote with a large buffer size, but when `bound` is small it happens with some regularity. The second is a peculiarity of Go. On line 87 there is a *type assertion* which de-references an `Elt` to yield a value of type `interface{}`. The `interface{}` contains a pointer to some runtime information about the actual type of the pointed-to struct, and the `.(*weakWaiter)` syntax queries if `elt` is a pointer to a weakWaiter. This is a safe thing to do because `weakWaiter` is a package-private type: no external caller could pass in an `Elt` that pointed to a `weakWaiter` unless we returned one from any of the public functions in the package, which we do not. This is complicated by the fact that `*waiter`s are actually stored in the queue directly, without hiding behind an interface value (e.g. at line 90). This is because the extra layer of indirection is unnecessary: it is always possible to determine where an `Elt` or a `*waiter` is present in a given location based on which CAS's have failed and which have succeeded. # Performance We benchmarked 5 separate channels on enqueue/dequeue pairs: * *Bounded0*: A `BoundedChan` with buffer size 0 * *Bounded1K*: A `BoundedChan` with buffer size 1024 * *Unbounded*: An `UnboundedChan` * *Chan0*: An unbuffered native Go channel * *Chan1K*: A native Go channel with buffer size 1024 * *Chan10M*: A native Go channel with buffer size $10^7$, which is the total number of elements enqueued into the channel over the course of the benchmark. We include benchmark results for two cases: one where we allocate one goroutine per processor (where processors are set with the `GOMAXPROCS` procedure from Go's runtime), and one where we allocate 5000 goroutines, irrespective of the current value of `GOMAXPROCS`. We include both of these for two reasons. First, it is not uncommon to have thousands of goroutines active in a running Go program and it makes sense to consider the case where processors are over-subscribed in that manner. Second, we noticed that performance is often *better* in the cases where cores are oversubscribed. While counter-intuitive, this is possibly due to a combination of unpredictable scheduler performance, and the lower overhead of synchronizing between two goroutines executing on the same core. These benchmarks were conducted on a machine with 2 Intel Xeon 2620 v4 CPUs each with 8 cores clocked at 2.1GHz with two hardware threads per core. We were unable to allocate cores in an intuitive manner, so the 16-core benchmark is actually using all of a single CPU's hardware threads; only at core-counts higher than 16 does the program cross a NUMA-domain. The benchmarks were run on the Windows Subsystem for Linux[^wsl]; an implementation of an Ubuntu 14.04 userland from within the Windows 10 Operating System. These benchmarks were conducted using Go version 1.6. These numbers were produced by performing 5,000,000 enqueues and dequeues per configuration, averaged over 5 iterations per setting, with a full GC between iterations. The benchmarks show that both *Bounded* and *Unbounded* are able to increase throughput as the core-count increases. Native Go channels are unable to do so. When using more than 4 processors, *Unbounded* and *Bounded1K* provide much better throughput than native channels regardless of buffer size. *Unbounded* in particular is often 2-3x faster than the buffered *Chan* configurations, while *Bounded0* continues to increase throughput even after crossing a NUMA domain and dipping into using multiple hardware threads per core. At the highest core counts, all three new configurations outpace native Go channels. ![](contend_graph.pdf) ![](gmp_graph.pdf) # Linearizability We contend that both the bounded and unbounded queues presented in this document are *linearizable* with respect to their Enqueue and Dequeue operations. Linearizability is a strong consistency guarantee often used to specify the behavior of concurrent data-structures. Informally we say a structure is linearizable if for an arbitrary (possibly infinite) history of concurrent operations on the structure beginning and ending at specific times, we can *linearize* it such that each each operation occurs atomically at some point in time between it beginning and ending [See Chapter 3 of @herlihyBook for an overview; Linearizability was introduced with @herlihyLinear]. This section describes linearization procedures for the bounded and unbounded channels in this document. Both channels begin with a fetch-add on the head or tail index for the queue that determines the *logical index* that will be the subject of their send or receive. We denote $e_i$ and $d_i$ the enqueue and dequeue operations that fetch-add to get a value of `myInd` equal to $i$. We will provide linearizations that preserve the following properties, where $\prec$ indicates precedence in the linearized sequence of events. For all $i$ we must have that (1) $e_i \prec e_{i+1}$ (if $e_{i+1}$ occurs) (2) $d_i \prec d_{i+1}$ (if $d_{i+1}$ occurs) (3) $e_i \prec d_i$ (if both occur) Which we take to be a straight-forward sequential specification for a channel. ## Unbounded Channels Our linearization procedure considers two broad cases, a fast and slow path. * In the fast path there is sufficient distance between enqueuers and dequeuers such that the fetch-add of $e_i$ occurs before the fetch-add for $d_i$ *and* $e_i$'s CAS succeeds. In this case, linearize $e_i$ and $d_i$ at their respective fetch-adds. * In the case where $d_i$'s fetch-add occurs before that of $e_i$ (or the CAS fails) we linearize *both* operations at $e_i$'s fetch-add, with $e_i$ occurring just before $d_i$. Observe that both cases in this procedure linearize $e_i,d_i$ between them starting and finishing. The second case is guaranteed to do so because if $d_i$ must block then $e_i$ is responsible for unblocking them, and if $d_i$ does not block then we know its CAS fails, meaning that $e_i$'s fetch-add occurs after $d_i$'s fetch-add but before its failed CAS. We will now show that the above procedure yields a history consistent with the three criteria provided above. The proof strategy is to show, for both the fast and slow paths, that we can maintain the criteria for an arbitrary $e_i,d_i,e_{i+1},d_{i+1}$. Given this we can conclude that the criteria are satisfied for an arbitrary number of enqueue-dequeue pairs. We then consider the other possible cases. *The Fast Path* We know that we satisfy (1) because all $e_i$, fast or slow path, linearize at their fetch-add, and these are guaranteed to provide a total ordering on operations. We satisfy (3) by assumption. Consider $d_{i+1}$, if it hits the fast path then it is linearized at its fetch-add which must happen after $d_i$'s fetch-add. If it hits the slow path then it will be linearized at the fetch-add of $e_{i+1}$, but by assumption we only hit the slow path if $d_{i+1}$'s fetch-add completed before that of $e_{i+1}$; $d_{i+1}$'s fetch-add definitely completed after that of $d_i$, so we satisfy (2). *The Slow Path* The argument for (1) is the same as in the fast path, and the argument for (3) follows by assumption. Once again, the interesting case is to show that we maintain an ordering between dequeue operations. There are two possible cases: (1) *$d_{i+1}$ blocks* We know that $d_{i+1}$ will take the slow path, and will therefore be linearized at a later fetch-add. (2) *$d_{i+1}$ does not block* The only way that $d_{i+1}$ does not block is if its CAS fails, which means that there is another enqueuer $e_{i+1}$ that completed. Regardless of whether $d_{i+1}$ is linearized on a slow path or a fast path, it must be after the fetch-add in $e_{i+1}$ and hence also that of $e_i$. *Small Numbers of Operations* If there is only one enqueue operation, then at most one dequeue operation will be linearized. This is fine, because at most one dequeue operation will complete, while any others will block forever. The definitions of the two cases in the linearization procedure automatically yield condition (3), while (1,2) are trivially satisfied as there is only one enqueue and at most one dequeue. *Concluding* We can conclude by induction that for any finite number of enqueues and dequeues, there is a linearization that satisfies a standard sequential specification for a channel. For infinite sequences of operations (assuming `H` and `T` can be updated with with arbitrary precision) there is probably a similar co-inductive characterization of the same process; the above argument should still hold. We conclude that unbounded channels are linearizable. $\square$ ## Bounded Channels The bounded case has the same linearization procedure (and proofs) as the unbounded case, with the caveat that enqueue operations that do not return never make it into the history. This works because all operations unconditionally perform fetch-adds, even if they later have to block for an unbounded amount of time. $\square$ # Conclusion and Future Work This document demonstrates that it is possible to have scalable unbounded and bounded queues while still satisfying a strong consistency guarantee. It leverages techniques from the recent literature on non-blocking queues to implement (to our knowledge) novel blocking constructs. There are a number of avenues for future work. **Verification** It will be useful to model both channels in [SPIN](http://spinroot.com/spin/whatispin.html) or [TLA+](http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html) to provide further assurance that the algorithms are correct. While it would be more involved, proving correctness in [Coq](https://coq.inria.fr/) in line with techniques mentioned in [FRAP](http://adam.chlipala.net/frap/) would also be helpful in building confidence in the algorithms. **Implement in the Go runtime** Implementing these channels within the runtime could further reduce these algorithms' overhead. In particular they will allow for more efficient implementation of the blocking semantics in that they can access goroutine and scheduling metadata directly, whereas the current implementation relies on `WaitGroup`s, which may be too heavyweight for our purposes. **Improving Performance** Some variants of this algorithm still perform worse at lower core-counts than their native Go equivalents. One possible reason for this is how much allocation these queues perform (go channels need only keep a single fix-sized buffer). It could be fruitful to experiment with schemes that reduce allocation, as well as algorithms that allocate a fix-sized buffer, similar to the CRQ algorithm in @lcrq. # Appendix A: Efficient Segment Allocation In order to speed up allocation, we add a list to the queue state. This list is similar to standard lock-free queue designs in the literature, and bares some resemblance to `Queue1` above. The major difference here is that we only provide partial push and pop operations: Push will fail if the list may be too large or if it runs out of `patience`, and Pop will fail if its CAS fails more than `patience` times. ~~~~{.go} type listElt *segment type segList struct { type segLink struct { MaxSpares, Length int64 Elt listElt Head *segLink Next *segLink } } func (s *segList) TryPush(e listElt) { // bail out if list is at capacity if atomic.LoadInt64(&s.Length) >= s.MaxSpares { return } // add to length. Note that this is not atomic with respect to the append, // which means we may be under capacity on occasion. This list is only used // in a best-effort capacity, so that is okay. atomic.AddInt64(&s.Length, 1) tl := &segLink{Elt: e, Next: nil} const patience = 4 i := 0 for ; i < patience; i++ { // attempt to cas Head from nil to tail, if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)), unsafe.Pointer(nil), unsafe.Pointer(tl)) { break } // try to find an empty element tailPtr := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)))) if tailPtr == nil { // if Head was switched to nil, retry continue } // advance tailPtr until it has anil next pointer for { next := (*segLink)(atomic.LoadPointer( (*unsafe.Pointer)(unsafe.Pointer(&tailPtr.Next)))) if next == nil { break } tailPtr = next } // try and add something to the end of the list if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tailPtr.Next)), unsafe.Pointer(nil), unsafe.Pointer(tl)) { break } } if i == patience { atomic.AddInt64(&s.Length, -1) } } func (s *segList) TryPop() (e listElt, ok bool) { const patience = 1 if atomic.LoadInt64(&s.Length) <= 0 { return nil, false } for i := 0; i < patience; i++ { hd := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)))) if hd == nil { return nil, false } // if head is not nil, try to swap it for its next pointer nxt := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&hd.Next)))) if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)), unsafe.Pointer(hd), unsafe.Pointer(nxt)) { atomic.AddInt64(&s.Length, -1) return hd.Elt, true } } return nil, false } ~~~~ Given this list implementation, we simply insert calls to `TryPush` and `TryPop` around the original implementation of `Grow` to have it take advantage of extra allocations: ~~~~ {.go} type queue struct { H, T index SpareAllocs segList } func (q *queue) Grow(tail *segment) { curTail := atomic.LoadUint64((*uint64)(&tail.ID)) if next, ok := q.SpareAllocs.TryPop(); ok { atomic.StoreUint64((*uint64)(&next.ID), curTail+1) if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)), unsafe.Pointer(nil), unsafe.Pointer(next)) { return } } newSegment := &segment{ID: index(curTail + 1)} if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)), unsafe.Pointer(nil), unsafe.Pointer(newSegment)) { return } // If we allocated a new segment but failed, attempt to place it in // SpareAlloc so someone else can use it. q.SpareAllocs.TryPush(newSegment) } ~~~~ This scheme led to significant speedups in performance tests, but the code in `q.go` includes a constant that, if set to false, will disable any such list-based caching of allocations. This should make it easy to verify or falsify those performance measurements. # References [^pseudo]: Psuedocode in this document will increasingly resemble real, working Go code. While we will try to explain core Go concepts as we go, a passing familiarity with Go syntax (or at least a willingness to squint and pretend one is reading C) will be helpful. [^select]: Our focus is send and receive; we do not cover `select` or `close` here. `Close` would be fairly simple to add, `select` could be implement by using channels for the waiting mechanism used by receivers. While this would not be difficult, it would slow things down compared to the `WaitGroup` implementation. [^goroutine]: Go's standard unit of concurrency is called a goroutine. Goroutines take the place of threads in a language like C, but they are generally much cheaper to create and provide faster context switches. Many goroutines are independently scheduled on top of a smaller number of native operating system threads. This scheduling is not preemptive in the standard implementation, rather goroutines implicitly yield on function-call boundaries. [^chanimp]: See the [Go channel source](https://golang.org/src/runtime/chan.go). In particular note calls to `lock` in `chansend` and `chanrecv`. [^combine]: Consult the related work sections of @wfq on *combining* queues for an example of this; @lcrq has a similar survey [^msqueue]: Less contention is not something that you get automatically when the algorithm is lock-free. An early lock-free queue @MSQueue still suffers from from bottlenecks around the head and tail pointers all being CAS-ed by contending threads. Most of these CASes will fail, and all threads whose CASes fail must retry. Exponential backoff schemes can help this state of affairs but the bottleneck is still present; see the performance measurements in @wfq with includes the algorithm from @MSQueue. [^fasem]: F&A is more commonly defined to return the *old* value of `src`, but returning the new value is equivalent. [^helping]: Helping is a standard technique for making obstruction-free or lock-free algorithms wait free. The technique goes back to @wfsync; the practice of using a weaker progress guarantee as a fast path and then falling back to a helping mechanism to ensure wait freedom was introduced in @FastSlow. An explanation of helping can be found in @herlihyBook chapters 6, 10.5. [^bounded]: See, for example, [this discussion](https://mail.mozilla.org/pipermail/rust-dev/2013-December/007449.html) on the Rust mailing list regarding unbounded channels. Haskell's standard channel implementation in [Control.Concurrent](https://hackage.haskell.org/package/base-4.9.0.0/docs/Control-Concurrent-Chan.html) is unbounded, as are the STM variants. [^wsl]: See [this blog post](https://blogs.msdn.microsoft.com/wsl/2016/04/22/windows-subsystem-for-linux-overview/) as well as the various [follow-ups](https://blogs.msdn.microsoft.com/wsl/) for an overview of this system. [^livelockdef]: A [livelock](https://en.wikipedia.org/wiki/Deadlock#Livelock) is a scenario in which one or more threads never block (i.e. they continuously change their respective states) but still indefinitely fail to make progress.