[
  {
    "path": "DistributedProgrammingRPC/README.md",
    "content": "# DistributedProgrammingTalk\nReferences and Credits for A Brief History of Distributed Programing: RPC.  To be given at CodeMesh 2016 [[Slides](https://speakerdeck.com/caitiem20/a-brief-history-of-distributed-programming-rpc)] [[Video](https://www.youtube.com/watch?v=aDWZyYHj2XM)]\n\n#Abstract\nWhile many of the distributed systems we operate today are built with language like Java and Go, distributed programming has a long history of innovation and adoption of its ideas. This include innovations seen all throughout the various fields of computing: novel type systems for dynamic languages; the concept of the promise, now a standard programming technique in web development;  and unified models of programming when data lives across nodes. Some of these ideas had major impact, while some fell incredibly short. Many technically superior ideas were not adopted simply because they were too “research” focused.\nDuring this talk, we will present the history of RPC and why RPC may not be the best abstraction for building your next distributed application.\n\n#Resources\n* [Node.js](https://nodejs.org/en/about/)\n* [The Go Programming Language](https://golang.org/)\n* [Finagle](https://twitter.github.io/finagle/)\n* [gRPC](http://www.grpc.io/)\n* [Blog: Remote Procedure Call](https://christophermeiklejohn.com/pl/2016/04/12/rpc.html) by Christopher Meiklejohn\n* [RFC 674](https://tools.ietf.org/html/rfc674)\n* [RFC 684](https://tools.ietf.org/html/rfc684)\n* [RFC 707](https://tools.ietf.org/html/rfc707)\n* [Implementing Remote Procedure Calls](http://www.cs.virginia.edu/~zaher/classes/CS656/birrel.pdf)\n* [A Critique of the Remote Procedure Call](http://www.cs.vu.nl/~ast/Publications/Papers/euteco-1988.pdf)\n* [RFC 1094 NFS](https://tools.ietf.org/html/rfc1094)\n* [A Note on Distributed Computation](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.41.7628&rep=rep1&type=pdf)\n* Spores \n  * [Strange Loop Talk](https://www.youtube.com/watch?v=coX9RKH4rOs) by Heather Miller\n  * [Spores: A Type-Based Foundation for Closures in the Age of Concurrency and Distribution](https://infoscience.epfl.ch/record/191239/files/spores_1.pdf)\n"
  },
  {
    "path": "DistributedProgrammingRPC/credits.md",
    "content": "#Image Credits\n* https://thenounproject.com/search/?q=laptop+user&i=512528\n* https://thenounproject.com/term/user/512525/\n* https://thenounproject.com/term/browser-cloud/523468/\n* https://thenounproject.com/search/?q=database&i=9658\n* https://thenounproject.com/search/?q=phone&i=565365\n"
  },
  {
    "path": "Halo4/Readme.md",
    "content": "## Building the Halo 4 Services with Orleans\nGiven at QCon London 2015 [[Video](https://www.infoq.com/presentations/halo-4-orleans)] [[Slides](https://speakerdeck.com/caitiem20/qcon-london-2015-building-the-halo-4-services-with-orleans)]\n\n### Summary\nCaitie McCaffrey does an overview of Orleans, the challenges faced when building the Halo 4 services, and why the Actor Model and Orleans in particular were utilized to solve these problems.\n\n## Architecting and Launching the Halo 4 Services\nGiven as a the evening Keynote at SRECon 2015 [[Video](https://www.usenix.org/conference/srecon15/program/presentation/mccaffrey)] [[Slides](https://speakerdeck.com/caitiem20/architecting-and-launching-the-halo-4-services-sre-con-15)]\n\n### Abstract\nHalo 4 is a first-person shooter on the Xbox 360, with fast-paced, competitive gameplay. To complement the code on disc, a set of services were developed and deployed in Azure to store player statistics, display player presence information, deliver daily challenges, modify playlists, catch cheaters, and more.  As of June 2013, Halo 4 had 11.6 million players who played 1.5 billion games, logging 270 million hours of gameplay.\n\nThe Halo 4 services were built from the ground up to support high demand, low latency, and high availability.  In addition, video games have unique load patterns where the majority of the traffic and sales occurs within the first few weeks after launch, making this a critical time period for the game and supporting services. Halo 4 went from 0 to 1 million users on day 1, and 4 million users within the first week.\n\nThis talk will discuss the architectural challenges faced when building these services and how they were solved using Windows Azure and Project Orleans. In addition, we'll discuss the path to production, some of the difficulties faced, and the tooling and practices that made the launch successful.\n\n"
  },
  {
    "path": "PapersWeLove/README.md",
    "content": "The accompanying repository for all of the Papers We Love Talks I've given\n\n## Orleans: Virtual Actors For Programability & Scalability.\nGiven at Papers We Love DC: September 2014 & Papers We Love SF: February 2015. [[Video](https://www.youtube.com/watch?v=gY8zKZUazvo)] [[Slides](https://speakerdeck.com/caitiem20/papers-we-love-sf-orleans-distributed-virtual-actors-for-programmability-and-scalability)] [[Paper](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Orleans-MSR-TR-2014-41.pdf)]\n\n## Sagas\nGiven at Papers We Love SF: April 2016 [[Video](https://youtu.be/7dc4Tl5ZHRg?list=PLGRqfvsPiRSih6qb8PRAQYQV9dq9pMgNX)] [[Slides](https://speakerdeck.com/caitiem20/papers-we-love-sf-sagas)] [[Paper](https://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf)]\n\n## Simple Testing Can Prevent Most Critical Failures\nGiven at Papers We Love NY: June 2016 [[Video](https://www.youtube.com/watch?v=-3tw2MYYT0Q&feature=youtu.be&t=1h6m17s)] [[Slides](https://speakerdeck.com/caitiem20/pwl-ny-simple-testing-can-prevent-most-critical-failures)] [[Paper](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf)]\n\n## Detection of Mutual Inconsistency in Distributed Systems\nGiven at Papers We Love PDX: June 2016 [No Video] [[Slides](https://speakerdeck.com/caitiem20/papers-we-love-pdx)] [[Paper](http://zoo.cs.yale.edu/classes/cs422/2013/bib/parker83detection.pdf)]\n\n## Distributed Programming in Argus\nGiven at Papers We Love SF: January 2017 & Papers We Love SEA November 2017 [[Slides](https://speakerdeck.com/caitiem20/argus-papers-we-love)] [[Paper](https://pdos.csail.mit.edu/6.824/papers/argus88.pdf)]\n\n\n\n"
  },
  {
    "path": "README.md",
    "content": "# Talks\nThis repository contains resources, references & credits for talks I've given.  \n"
  },
  {
    "path": "Sagas/README.md",
    "content": "##Applying the Saga Pattern\nThis talk was given at Craft Conf 2015 & Goto Chicago 2015.  [[Video](https://www.youtube.com/watch?v=xDuwrtwYHu8&index=46&list=PLEx5khR4g7PKFs3Y-gWd8TX4Y_5yTyUTP)] [[Slides](https://speakerdeck.com/caitiem20/applying-the-saga-pattern)]\n\n###Abstract\nAs we build larger more complex applications and solutions that need to do collaborative processing the traditional ACID transaction model using coordinated 2-phase commit is often no longer suitable. More frequently we have long lived transactions or must act upon resources distributed across various locations and trust boundaries. The Saga Pattern is a useful model for long lived activities and distributed transactions without coordination.\n\nSagas split work into a set of transactions whose effects can be reversed even after the work has been performed or committed. If a failure occurs compensating transactions are performed to rollback the work. So at its core the Saga is a failure Management Pattern, making it particularly applicable to distributed systems.\n\nIn this talk, I'll discuss the fundamentals of the Saga Pattern, and how it can be applied to your systems. In addition we'll discuss how the Halo 4 Services successfully made use of the Saga Pattern when processing game statistics, and how we implemented it in production.\n"
  },
  {
    "path": "ScalingStatefulServices/readme.md",
    "content": "# Scaling Stateful Services\nThe accompanying repository for The ScalingStatefulServices talk\n* V1 StrangeLoop 2015: [Video](https://www.youtube.com/watch?v=H0i_bXKwujQ), [Slides](https://speakerdeck.com/caitiem20/building-scalable-stateful-services), [High Scalability Article](http://highscalability.com/blog/2015/10/12/making-the-case-for-building-scalable-stateful-services-in-t.html), [InfoQ Article](http://www.infoq.com/news/2015/11/scaling-stateful-services)\n* V2 Craft Conf 2016: [Slides](https://speakerdeck.com/caitiem20/craftconf-2016-building-scalable-stateful-services#)\n* V4 Curry On 2016 [Video](https://www.youtube.com/watch?v=aJFxQAAMAQc)\n* V3 Nike Tech Talk 2016: [Slides](https://speakerdeck.com/caitiem20/building-scalable-stateful-services-1)\n\n## Abstract\nThe Stateless Service design principle has become ubiquitous in the tech industry \nfor creating horizontally scalable services.  However our applications do have state, \nwe just have moved all of it to caches and databases.  Today as applications are \nbecoming more data intensive and request latencies are expected to be incredibly \nlow, we’d like the benefits of stateful services, like data locality and sticky \nconsistency.  In this talk I will address the benefits of stateful services,\nhow to build them so that they scale, and discuss  distributed and scalable\nservices in the real world that implement these techniques successfully.\n\n## Resources\n* Consistency Models\n  * [Strong Consistency Models](https://aphyr.com/posts/313-strong-consistency-models) by Kyle Kingsbury\n  * [Types of Consistency](http://www.cs.colostate.edu/~cs551/CourseNotes/Consistency/TypesConsistency.html)\n  * [Consistency Diagram](http://www.vldb.org/pvldb/vol7/p181-bailis.pdf) by Peter Bailis et al\n  * [Eventual Consistency Revisited](http://www.allthingsdistributed.com/2008/12/eventually_consistent.html) by Werner Vogels\n* Cluster Membership: Gossip Protocols \n  * [Gossip Protocol](https://en.wikipedia.org/wiki/Gossip_protocol)\n  * [Epidemeic Algorithms for Replicated Database Maintenance](https://pdfs.semanticscholar.org/49ed/15db181c74c7067ec01800fb5392411c868c.pdf)\n  * [Membership, Disemination and Population Protocols](https://qconnewyork.com/ny2016/ny2016/presentation/membership-dissemination-and-population-protocols.html) Video to Come\n* Work Distribution: Consistent Hashing & Distributed Hash Tables\n  * [Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web](https://www.akamai.com/es/es/multimedia/documents/technical-publication/consistent-hashing-and-random-trees-distributed-caching-protocols-for-relieving-hot-spots-on-the-world-wide-web-technical-publication.pdf)\n  * [Dynamo: Amazon's Highly Available Key Value Store](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) uses Consistent Hashing\n  * [Distributed Hash Table](https://en.wikipedia.org/wiki/Distributed_hash_table)\n* Real World Systems\n  * [Scuba: Diving into Data at Facebook](https://research.facebook.com/publications/scuba-diving-into-data-at-facebook/)\n  * [Twitter Nuthatch Service](https://blog.twitter.com/2016/observability-at-twitter-technical-overview-part-i) from Observability at Twitter: Technical Overview, part 1\n  * Uber Ringpop\n    * [Uber Ringpop](http://uber.github.io/ringpop/)\n    * [Uber's Ringpop and the Fight for Flap Dampening](http://www.infoq.com/presentations/halo-4-orleans)\n    * [SWIM:Scalable Weakly-consistnet Infection-style Process Group Membership Protocol](https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)\n  * Orleans\n    * [Microsoft Orleans](http://dotnet.github.io/orleans/)\n    * [Orleans: Distributed Virtual Actors for Programmability and Scalability](http://research.microsoft.com/apps/pubs/default.aspx?id=210931)\n    * [Builiding the Halo 4 Services with Orleans](http://www.infoq.com/presentations/halo-4-orleans)\n    * [Artificial Intelligence A Universal Modular ACTOR Formalism for Artificial Intelligence](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.77.7898)\n* Challenges & Lessons Learned\n  * [Everything will Flow: Distributed Queues & Backpressure](https://www.youtube.com/watch?v=1bNOO3xxMc0&app=desktop)\n  * [Fast Database Restarts at Facebook](https://research.facebook.com/publications/fast-database-restarts-at-facebook/)\n* Intro to Distributed Systems\n  * [What we Talk about when we talk about Distributed Systems](http://videlalvaro.github.io/2015/12/learning-about-distributed-systems.html)\n\n## Bio\nCaitie McCaffrey is a Backend Brat and Distributed Systems Diva at Twitter.  Prior to that she spent the majority of her career building large scale services and systems that power the entertainment industry at 343 Industries, Microsoft Game Studios, and HBO.  Caitie has a degree in Computer Science from Cornell University, and has worked on several video games including Gears of War 2, Gears of War 3, Halo 4, and Halo 5 She maintains a blog at  CaitieM.com  and frequently discusses technology on Twitter @Caitie\n"
  },
  {
    "path": "SoWeHearYouLikePapers/README.md",
    "content": "So We Hear You Like Papers is a series of talks that bridge the gap between the academia and industry, given by myself and [Ines Sombra](https://github.com/Randommood).  Each version of the talk expores different academic concepts in distributed systems and how they apply to industry. \n\n## So We Hear You Like Papers\nGiven as the evening Keynote at QconSF 2015.  [[Slides](https://speakerdeck.com/randommood/we-hear-you-like-papers-qcon-edition)] [[Video](https://www.infoq.com/presentations/papers-large-distributed-systems)] [[Resources](https://github.com/Randommood/QConSF2015)]\n\n## So We Hear You Like Papers Too\nGiven as a Keynote at Velocity Conf Santa Clara 2016 [[Slides](https://speakerdeck.com/randommood/we-hear-you-like-papers-velocity-edition)] [[Video](https://www.oreilly.com/ideas/so-we-hear-you-like-papers)] [[Resources](https://github.com/Randommood/Velocity2016)]\n\n## So We Hear You Like Papers: Eventual Consistency\nGiven at Women Who Code Sydney Meetup December 2016 [[Slides](https://speakerdeck.com/caitiem20/we-hear-you-like-papers-eventual-consistency)]\n\n### Resources\n* [Detection of Mutual Inconsistency in Distributed Systems](http://zoo.cs.yale.edu/classes/cs422/2013/bib/parker83detection.pdf)\n* [Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](http://www.cs.berkeley.edu/~brewer/cs262b/update-conflicts.pdf)\n* [Peter Bailis - Papers We Love: Managing Update Conflicts in Bayou, A Weakly Connected Replicated Storage System](https://www.youtube.com/watch?v=txP7CI0PjO4) Talk\n* [Brewer's conjecture & the feasibility of consistent, available, partition-tolerant web](http://perso.telecom-paristech.fr/~kuznetso/INF346-2015/papers/cap.pdf)\n* [CAP Twelve Years Later: How the \"Rules\" Have Changed](http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed)\n* [Conflict-free Replicated Data Types](https://hal.inria.fr/inria-00609399v1/document) Talk from Codemesh IO 2016\n* [A Conflict-Free Replicated JSON Datatype](https://martin.kleppmann.com/2016/08/13/json-crdt.html) Martin Kleppmann et al.\n* [A Comprehensive study of Convergent and Commutative Replicated Data Types](https://hal.inria.fr/inria-00555588)\n* [Readings in Conflict Free Replicated Data Types](https://christophermeiklejohn.com/crdt/2014/07/22/readings-in-crdts.html)\n* [Conflict Resolution for Eventual Consistency](https://www.youtube.com/watch?v=8_DfwEpHE88)\n* [Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf)\n* [The Morning Paper: Feral Concurrency Control](http://blog.acolyer.org/2015/09/04/feral-concurrency-control-an-empirical-investigation-of-modern-application-integrity/)\n"
  },
  {
    "path": "TacklingAlertFatigue/README.md",
    "content": "# Tackling Alert Fatigue\nAccompanying Repository for the \"Recovering From Alert Fagitue\" talk given at [Monitorama 2016](http://monitorama.com/) [[Slides](https://speakerdeck.com/caitiem20/tackling-alert-fatigue)]\n\n##Abstract\nSystems that generate numerous critical alerts result in alert fatigue which can result in service outages and developer burnout.  My team at Twitter found themselves in this situation.  The services had scaled by an order of magnitude in two years and were generating hundreds of alerts per quarter. Over the course of a quarter I led an initiative to decrease the number of alerts, improve the experience of being on call, and increase the reliability of the services.  These efforts were incredibly successful reducing the number of critical alerts by 50%.  In this talk I’ll discuss the process and alerting best practices we’ve put in places to successfully combat alert fatigue and avoid over alerting in the future.\n\n##References\n* [Novel Approach to Cardiac Alarm Management on Telemetry Units](http://www.nursingcenter.com/pdfjournal?AID=2545317&an=00005082-201409000-00016&Journal_ID=54006&Issue_ID=2544216)\n* [How one Hospital Tweaks its EHR to fight alert fatigue](http://www.healthcareitnews.com/news/how-one-hospital-tweaks-its-ehr-fight-alert-fatigue)\n* [Applying Cardiac Alarm Management to your Oncall](http://fractio.nl/2014/08/26/cardiac-alarms-and-ops/)\n* [Checklist Manifesto](http://www.amazon.com/Checklist-Manifesto-How-Things-Right/dp/0312430000)\n* [Engineering for the Long Game](https://www.infoq.com/presentations/continuous-innovation-systems-organizations/?utm_source=lanyrd&utm_medium=coverage&utm_campaign=lanyrdsfvideos)\n* [WTF is OPerations? #serverless](https://charity.wtf/2016/05/31/wtf-is-operations-serverless/)\n* [Devops for Developers Building an Effective Ops Org](http://www.ustream.tv/recorded/86181845)\n\n##Observability at Twitter\n* [Technical Overview Part 1](https://blog.twitter.com/2016/observability-at-twitter-technical-overview-part-i)\n* [Technical Overview Part 2](https://blog.twitter.com/2016/observability-at-twitter-technical-overview-part-ii)\n* [Of the Order of Billions: Building Observability at Twitter](https://www.youtube.com/watch?v=SC6XuD1tgcQ)\n\n##Related Tweets\n* (https://twitter.com/mrtazz/status/626107423443410944)\n\n\n##Bio\nCaitie McCaffrey is a Backend Brat and Distributed Systems Diva at Twitter, where she is the Tech Lead of the Observability Team.  Prior to that she spent the majority of her career building large scale services and systems that power the entertainment industry at 343 Industries, Microsoft Game Studios, and HBO.  Caitie has a degree in Computer Science from Cornell University, and has worked on several video games including Gears of War 2, Gears of War 3, Halo 4, and Halo 5 She maintains a blog at  CaitieM.com  and frequently discusses technology on Twitter @Caitie\n\n\n"
  },
  {
    "path": "TacklingAlertFatigue/credits.md",
    "content": "##Image Credits\n* https://www.flickr.com/photos/nokiae51/13950633381/\n* https://www.flickr.com/photos/richardsummers/523686414/\n* https://www.flickr.com/photos/ppapadimitriou/9201073336/\n* https://www.flickr.com/photos/kimeriksson/621870129/\n* https://www.flickr.com/photos/debsilver/102508862/\n* https://www.flickr.com/photos/yourcastlesdecor/14323386992/in/photolist-nPH6W9-7NYUFJ-bxTpuu-6Rjhwc-3uXmm-f17jkv-4T4Ds5-dzPRPZ-6j1PsU-7KvvDn-7NYULd-91k8BQ-phmhM4-7TR654-9LQDLV-5XFLND-oBDwNn-4Xb2L3-7HpUGE-7A4NKR-4m6HHq-3gt8jA-21dbc-pyNHEs-4mM8VA-neEQ8H-6NCaW3-a4gnBr-7NYUQY-dPoCq6-aiR1t8-4Shmo-666Dfo-aebp8e-6nDkE-6JLSsd-dtbpHT-ReyB5-cv9iV5-2sHHNm-bnmSkE-WXfjM-bqwpZN-m3GFx-jmiL6a-n1w19-eAsGk4-izhEW-8XMq8Z-666Cib\n* https://www.flickr.com/photos/glenscott/2593434622/\n* https://www.flickr.com/photos/68877365@N07/9500125786/in/photolist-ftuCqb-9jezut-fwi5YG-cagS-6WvUYp-dRKkmS-4Hs3ot-24whQ-a6yntM-6Kti3J-fsnucL-vQhq-2gmnHY-37jAYS-ARLam-49drf1-58qnqs-ke7AtH-4WXJhG-G3uet-68kmaN-6WSRye-8aqiU-9sjC3v-5mYbEA-5ZH94L-5kNqGH-9YPeeT-2DWhMw-2hBDK-dV2Fo-rr64Q8-8vAHKK-29GEg-2CLtH7-6QV2r5-5u4z4L-JkC4-5peTZ7-F3tqq-fgigu2-dKESSo-2S72xK-qyS5Wd-nbqWJM-qnHWoq-m1osdw-7yC8FT-95K2x1-dKMvt7\n* https://www.flickr.com/photos/greenputty/5250996246/\n* https://www.flickr.com/photos/warriorwoman531/5443359455/\n* https://www.flickr.com/photos/n-r-t/2635927764/\n"
  },
  {
    "path": "TacklingAlertFatigue/runbook.md",
    "content": "# Runbook Template\n\n## Table of Contents\nA Table of Contents with links to main sections\n\n## General\nA quick description of the services.  1 to 2 sentences max.  Why does this service matter?  What is it's core functionality?  What Features does it provide users?\n\n## Dashboards\nLinks to the Dashboards for this service\n\n## Alerts\nLinks to the Alerts for this service\n\nFor Every Alert there should be a corresponding section in alphabetical order\n### Alert Title\nAlert Description:  Why do we have this alert?  What does it mean?  What is typically the cause of this alert?\n\n#### Impact to Customers:\nHow does this situation impact our customers?  If the customers are not being impacted, this is a good indicator that the alert can be deleted.\n\n#### Remediation Steps:\nChecklist manifesto style steps for how to resolve this alert.  A person who has never worked on our stack should be able to follow these steps and remediate the incident.  If it cannot be remediated, include escalation steps here.\n 1. Do this\n 2. Check this graph\n 3. Do this thing \n 4. Do this other thing\n 5. Verify service has recovered\n \n## Contact Info\nTeam contact info.  Potentially contact info for who to escalate to.  What services do we have dependencies on?  How do we escalate to them?  Define this information here.  \n\n## Latest Deployments\nWe do Production Change Management Deployments via Jira, we included a link of all the latest changes here.  Recent commits, CI log etc... is incredibly helpful in understanding what code is deployed to the system, what recent changes were made.\n\n## Clusters\nInformation on where this service is deployed, and how to access those machines.\n\n## Deployment\nHow do you deploy this services.  Favor Checklist manifesto style lists here as well. \n 1. Do this thing\n 2. Do this other thing\n 3. Finally do this thing \n \n### Canary Deploy\nInstructions on how to do a Canary Deployment\n 1. Do this canary thing\n 2. another canary task\n \n### Rollback Deploy\nInstructions on how to Rollback a Deploy. \n 1. Get the rollback build here\n 2. Do this thing\n 3. Do this other thing.  \n\n\n\n"
  },
  {
    "path": "TheVerificationOfADistributedSystem/README.md",
    "content": "# The Verification of a Distributed System\nAccompanying Repository for The Verification of a Distributed System Talk to be given at \n* [GOTO Chicago 2016](http://gotocon.com/chicago-2016): [[Slides](https://speakerdeck.com/caitiem20/the-verification-of-a-distributed-system)][[Video](https://youtu.be/kDh5BrqiGhI?list=PLEx5khR4g7PIfvppVcaTPa5IKWTjoASRU)] \n* [Qcon New York 2016](https://qconnewyork.com/ny2016/presentation/verification-distributed-system) [[Slides](https://speakerdeck.com/caitiem20/qcon-newyork-2016-the-verification-of-a-distributed-system)][[Video](https://www.infoq.com/presentations/distributed-systems-verification)]\n* [YOW Melbourne 2016](http://melbourne.yowconference.com.au/) [[Slides](https://speakerdeck.com/caitiem20/the-verification-of-a-distributed-system-1)]\n* [YOW Brisbane 2016](http://brisbane.yowconference.com.au/) [[Slides](https://speakerdeck.com/caitiem20/the-verification-of-a-distributed-system-2)] \n* [YOW Sydney 2016](http://sydney.yowconference.com.au/) [[Slides](https://speakerdeck.com/caitiem20/the-verification-of-a-distributed-system-3)]\n\n## Abstract\nDistributed Systems are difficult to build and test for two main reasons: partial failure & asynchrony.  These two realities of distributed systems must be addressed to create a correct system, and often times the resulting systems have a high degree of complexity.  Because of this complexity, testing and verifying these systems is critically important.  In this talk we will discuss strategies for proving a system is correct, like formal methods, and less strenuous methods of testing which can help increase our confidence that our systems are doing the right thing.\n\n## References\n* [The Verification of a Distributed System](http://queue.acm.org/detail.cfm?id=2889274)\n* Formal Specifications\n  * [Specifying Systems](http://research.microsoft.com/en-us/um/people/lamport/tla/book-02-08-08.pdf)\n  * [Use of Formal Methods at Amazon Web Services](http://research.microsoft.com/en-us/um/people/lamport/tla/formal-methods-amazon.pdf)\n  * [The Coq Proof Assistant](https://coq.inria.fr/)\n* [Simple Testing Can Prevent Most Critical Failures](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf)\n* Property Based Testing\n  * [Haskell: Quick Check](https://hackage.haskell.org/package/QuickCheck)\n  * [Erlang: Quick Check](http://www.quviq.com/products/erlang-quickcheck/)\n  * [Other Quick Check Implementations](https://en.wikipedia.org/wiki/QuickCheck)\n  * [ScalaCheck](https://www.scalacheck.org/)\n  * [29 GIFs only ScalaCheck Witches will Understand](http://nerd.kelseyinnis.com/blog/2015/01/14/29-GIFs-only-scalacheck-witches-will-understand/)\n  * [Quick Checking Riak](https://skillsmatter.com/skillscasts/4505-quickchecking-riak)\n  * [Testing Eventual Consistency in Riak](https://www.youtube.com/watch?v=x9mW54GJpG0)\n  * [Combining Model Checking and Testing](http://research.microsoft.com/pubs/200544/main.pdf)\n  * [Testing AUTOSTAR Software with QuickCheck](http://ieeexplore.ieee.org/xpl/login.jsp?reload=true&tp=&arnumber=7107466&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D7107466)\n  * [Modeling Eventual Consistency Databases with QuickCheck](https://vimeo.com/23220830)\n  * [The Mysteries of Dropbox](https://vimeo.com/158002499)\n* Fault Injection\n  * [Jepsen](http://jepsen.io/)\n  * [Netflix Simian Army](http://techblog.netflix.com/2011/07/netflix-simian-army.html)\n  * Game Days\n    * [Resilience Engineering: Learning to Embrace Failure](https://queue.acm.org/detail.cfm?id=2371297)\n    * [Game Day Exercises at Stripe: Learning from `kill-9`](https://stripe.com/blog/game-day-exercises-at-stripe)\n* Systems Complexity Model from [Architectural Patterns of Resillent Distributed Systems](https://github.com/Randommood/YOW2016)\n* Areas of Research\n  * [Cause I'm Strong Enough: Reasoning about Consistency Choices in Distributed Systems](https://pages.lip6.fr/Marc.Shapiro/papers/CISE-POPL-2016.pdf) \n    * [The CISE Tool: Proving Weakly Consistent Applications Correct](https://hal.inria.fr/hal-01279495v1/document)\n    * [CISE Tool Demo](https://www.youtube.com/watch?v=HJjWqNDh-GA)\n    * [Github: Syncfree/CISE](https://github.com/SyncFree/CISE)\n    * [Syncfree CISE website](https://syncfree.lip6.fr/index.php/2-uncategorised/51-cise)\n * [IronFleet: Proving Practical Distributed Systems Correct](http://research.microsoft.com/apps/pubs/default.aspx?id=255833)\n   * [Dafny](http://research.microsoft.com/en-us/projects/dafny/)\n * Lineage-Driven Fault Injection aka Molly\n    * [Lineage-Driven Fault Injection](http://people.ucsc.edu/~palvaro/molly.pdf)\n    * [Sigmod 2015 Slides](http://www.slideshare.net/palvaro/lineagedriven-fault-injection-sigmod15)\n    * [Automated Failure Testing at Netflix](http://techblog.netflix.com/2016/01/automated-failure-testing.html)\n    * [\"Monkeys in Lab Coats\": Applied Failure Testing Research at Netflix](http://www.infoq.com/presentations/failure-test-research-netflix)\n    * [Automating Failure Testing Research at Internet Scale](https://people.ucsc.edu/~palvaro/socc16.pdf)\n    * [Orchestrated Chaos: Applying Failure Testing Research at Scale](https://www.youtube.com/watch?v=QOTNBKx9Irc)\n * [Towards Property Based Consistency Verification](http://www.eurecom.fr/fr/publication/4874/download/ds-publi-4874.pdf)\n * [Certified Causally Consistent Distributed Key Value Stores](http://people.csail.mit.edu/lesani/companion/popl16/POPL16.pdf)\n * [Planning for Change in a Formal Verification of the Raft Consensus Protocol](https://homes.cs.washington.edu/~mernst/pubs/raft-proof-cpp2016.pdf)\n\n\n## Bio\nCaitie McCaffrey is a Backend Brat and Distributed Systems Diva at Twitter.  Prior to that she spent the majority of her career building large scale services and systems that power the entertainment industry at 343 Industries, Microsoft Game Studios, and HBO.  Caitie has a degree in Computer Science from Cornell University, and has worked on several video games including Gears of War 2, Gears of War 3, Halo 4, and Halo 5.  She maintains a blog at [CaitieM.com](https://caitiem.com/) and frequently discusses technology on Twitter [@Caitie](https://twitter.com/caitie)\n"
  },
  {
    "path": "TheVerificationOfADistributedSystem/credits.md",
    "content": "# Image Credits\n* https://www.flickr.com/photos/allenthepostman/2701166533/\n* https://www.flickr.com/photos/ehktang/6820136333/\n* https://www.flickr.com/photos/craig21/16643183961/\n* https://www.flickr.com/photos/pachytime/3056606057/\n* https://www.flickr.com/photos/thomashawk/2727316420/in/photolist-5a1d9U-fkCzt8-forakw-fqXk3K-6GJsvL-3ewM6b-6ce9oL-d2pm5u-8Mc28D-3N8HVk-eayWoU-5EmFFT-6TWLkm-2GWc8n-4EcLbE-7FkuXb-6P9wA3-KDc9V-EYU7a-3BazNK-4TCRby-4kg9Vc-c19jP-6HDx7s-2TFPK-4HoTVj-aTcna2-b7zicc-6cFfeE-4EP5bm-5728Ar-5bT6kU-kMsUp-5FtsrE-c19jN-b4BqAt-9gb1q-55VTSc-5F9ZXB-pLN4Xm-4KXv84-2ESbDb-fERJqW-nWpqou-5X6fz7-fPsDJr-9zKXNy-nWpbLw-ftqE1w-gXdGn4/\n* https://www.flickr.com/photos/rob1501/6115982967/\n* https://www.flickr.com/photos/naathas/3319386898/\n* https://www.flickr.com/photos/pmillera4/21466941223/\n* https://www.flickr.com/photos/zionnps/6198493225/\n* https://www.flickr.com/photos/leemt2/222443032/\n* https://www.flickr.com/photos/joemar/1573687605/\n* https://www.flickr.com/photos/leemt2/203359933/\n* https://www.flickr.com/photos/doc44/8287100528/in/photolist-dCiygA-dBGHVZ-5AcVxm-dCpjhQ-dyv3Fn-ddyLx4-au6zsk-9E6zPz-dA6itL-dEen9D-dymHsJ-dSNAUn-dzc3VM-dyv39K-ddyMUL-q9jQcL-dyY7Gn-dANxgh-dBgxtL-dyPqxd-dB9VbR-dyuZZp-dzYv6R-6d9ozc-iDdQ2e-isNKTU-dyw2oF-dBb6Di-aufP4G-dyQwfu-duqS3h-dyv2ya-dwYEev-hQ8LQW-GwAi6Y-dHKhAx-7mYLLB-4BJZe5-eQVZVA-dfnVr2-dCiTd8-dxxsgY-6UX11V-dzhz9f-dziB5q-6RLFTu-dyBAg5-dyw5zB-dAGVvM-dAXoz2\n* https://www.flickr.com/photos/eepie/14433673/\n* https://www.flickr.com/photos/kookr/7056077277/\n* https://www.flickr.com/photos/omnia_mutantur/2468322428/in/photolist-4L7Ngf-bVWDFH-cSgeN1-asEyuf-axUKLP-brhFSE-cxTaWQ-aqBKGe-a9eJbP-kfksEJ-h5PkBA-asgFQ7-axjynd-FWnjB6-ddAV5G-ddASU4-9B2imo-5K73G-ddASVA-FU4Cb3-7Chmw3-Fwg8sd-FU69LQ-aym8cu-7ihaXW-F1VcMo-bVAFKn-cPz33A-FweM25-FU5YZq-FU4byb-FU4H4Q-FU4nh5-eJdVzG-apXPSV-e4jRMd-cxT8zu-5HzZ1c-ebfjTh-ddAT4q-ddATCJ-ddARUM-52fNs-hjCbb-dAz4Tu-6N8Zv-as3GF-6fnRjs-4u4QR1-ebfjYU\n* https://www.flickr.com/photos/44055945@N06/7224288232/\n* https://www.flickr.com/photos/vickisnature/6465872605/\n* https://www.flickr.com/photos/pinti1/15933819917/\n* https://www.flickr.com/photos/yokohamayomama/5778717345/\n* https://www.flickr.com/photos/dakiny/15079476520/\n* https://www.flickr.com/photos/dakiny/15079685280/\n* https://www.flickr.com/photos/vickisnature/4830596406/\n* https://www.flickr.com/photos/leemt2/284068408/\n"
  },
  {
    "path": "TwitterObservability/README.md",
    "content": "\n## On the Order of Billions: Building Observability at Twitter\nThis talk was given at Twitter Flight 2015.  [[Video](https://www.youtube.com/watch?v=SC6XuD1tgcQ)] [[Slides](https://speakerdeck.com/caitiem20/of-the-order-of-billions-building-observability-at-twitter)]\n\n### Abstract\nEvery minute Twitter’s Observability stack processes 1.5+ billion metrics in order to provide Visibility into Twitter’s distributed microservices architecture. In this talk will focus on some of the challenges associated with building and running this large scale distributed system. We will also focus on lessons learned and how to build services that scale that are applicable for services of any size.\n\n### Related Articles\n* [Observability at Twitter: technical overview, Part 1](https://blog.twitter.com/2016/observability-at-twitter-technical-overview-part-i)\n* [Observability at Twitter: technical overview, Part 2](https://blog.twitter.com/2016/observability-at-twitter-technical-overview-part-ii)\n"
  }
]