There is a trend towards distributed systems made up of rapidly changing ephemeral single function microservices that can be deployed globally using public clouds. Monitoring tools have tended to focus on a small number of slowly changing monolithic applications in a single datacenter at a time, so a new breed of monitoring tools is emerging that collect data every second, are cloud and container aware, and can see the flows of requests across microservices. Testing these tools is a challenge, and this talk will describe how the SimianViz simulator can be used to model interesting architectures and create stress tests and failure scenarios.
Adrian Cockcroft has had a long career working at the leading edge of technology. He's always been fascinated by what comes next, and he writes and speaks extensively on a range of subjects. At Battery, he advises the firm and its portfolio companies about technology issues and also assists with deal sourcing and due diligence.
Before joining Battery, Adrian helped lead Netflix's migration to a large scale, highly available public-cloud architecture and the open sourcing of the cloud-native NetflixOSS platform. Prior to that at Netflix he managed a team working on personalization algorithms and service-oriented refactoring.
Adrian was a founding member of eBay Research Labs, developing advanced mobile applications and even building his own homebrew phone, years before iPhone and Android launched. As a distinguished engineer at Sun Microsystems he wrote the best-selling "Sun Performance and Tuning" book and was chief architect for High Performance Technical Computing.
He graduated from The City University, London with a Bsc in Applied Physics and Electronics, and was named one of the top leaders in Cloud Computing in 2011 and 2012 by SearchCloudComputing magazine. He can usually be found on Twitter @adrianco.
Follow Adrian @adrianco
Topic: The rise of distributed systems and how to make it stop.
Nicholas Weaver is Director at Intel Corporation leading a team of cloud software prototyping engineers. Previously he was the lead automation architect at VMware's vCloud Air. His passion is in finding the next evolution of the datacenter through intelligent automation. He spends his spare time smoking meats, flying his drones, and wandering the wilderness of Oregon with his family.
Follow Nicholas @lynxbat
The Internet of Things is not only exciting but has huge potential to benefit society. This talk highlights three key opportunities and associated technical challenges. First, indoor-location based services promise to have a huge impact with a variety of novel and surprising services – especially when we can get highly accurate estimates of a person's or object's indoor location. Second, software-defined infrastructure has contributed to great advancements in the cloud, data center, and other complex distributed systems. We describe how it can potentially provide even greater benefits to IoT via simplification, automation, and agility to roll out new applications and services. Third, while today's popular mobile-to-cloud architecture is highly successful, many IoT applications require an enhancement of this architecture where an additional layer (the "Fog") brings some of the cloud's capabilities to the edge of the network – including support for real-time analytics. We will highlight the technical challenges, recent advances toward their solution, and promising future directions of research.
John Apostolopoulos is the VP/CTO of the Enterprise Segment where he drives the technology and architectural direction for Cisco's efforts in the enterprise space. John also directs the Innovation Labs within CTAO whose mission is to drive technology innovations aligned with Cisco's strategic directions. This covers the broad Cisco portfolio including the Internet of Things (IoT), enterprise mobility/BYOD, software defined networking (SDN) and SDN-empowered applications such as collaboration and security, video over wired/wireless networks, and network analytics. In addition, John co-leads Cisco's next-generation Enterprise Architecture effort.
Prior to joining Cisco, John was Lab Director for the Mobile & Immersive Experience Lab at HP Labs. His work spanned novel mobile devices and sensing, client/cloud multimedia computing, multimedia networking, signal processing, immersive video conferencing, SDN, and mobile streaming media content delivery networks for all-IP (4G) wireless networks.
John received a number of honors and awards including IEEE Fellow, IEEE SPS Distinguished Lecturer, named "one of the world's top 100 young (under 35) innovators in science and technology" (TR100) by MIT Technology Review, Certificate of Honor for contributing to the US Digital TV Standard (Engineering Emmy Award 1997), and his work on media transcoding in the middle of a network while preserving end-to-end security (secure transcoding) was adopted in the JPSEC standard. He has published over 100 papers, including receiving 5 best paper awards, and has about 65 granted US patents. John also has strong collaborations with the academic community and was a Consulting Associate Professor of EE at Stanford (2000-09), and is a frequent visiting lecturer at MIT. He received his B.S., M.S., and Ph.D. from MIT.
Follow John @john_apos
Deploy and manage the Basho Data Platform in any environment through Apache Brooklyn blueprints and the Cloudsoft Application Management Platform.
This session will show the suite of Riak and BDP blueprints available in the Apache Brooklyn ecosystem, walking participants through everything from simple deployment to complex management and multi-datacenter replication.
We'll highlight management available at deploy-time through simple blueprint configs and at runtime using sensors and effectors in the UI.
We'll also demo the latest work managing the BDP including bringing your own blueprints for data and analytics tools.
Mike Zaccardo is a software engineer at Cloudsoft and contributor to the Apache Brooklyn open source project, leading development on the Brooklyn BDP blueprints.
Follow Mike @ItsMeMikeZ
Alex Heneveld (Co-Founder & Chief Technology Officer) brings twenty years experience designing software solutions in the enterprise, start-up, and academic sectors.
Most recently Alex was with Enigmatec Corporation where he led the development of what is now the Monterey® Middleware Platform™. Previous to that, he founded PocketWatch Systems, commercialising results from his doctoral research.
Alex holds a PhD (Informatics) and an MSc (Cognitive Science) from the University of Edinburgh and an AB (Mathematics) from Princeton University. Alex was both a USA Today Academic All-Star and a Marshall Scholar.
Follow Alex @ahtweetin
Spark is a popular new platform for interactive high performance analytics, machine learning, and data processing. The trouble is, Spark tends to monopolize whatever Mesos cluster you run it on, so you either create completely separate Spark clusters for each user, or you otherwise limit the resources each user can use.
Cook is an advanced fair-sharing, preemptive scheduling backend for Spark. You can run one instance of Cook on your Mesos cluster, and it will automatically adapt the capacity for every user and team on your cluster, so that interactive jobs run immediately but utilization remains high.
Cook also has a REST API and Java client, and it's written in Clojure with Datomic.
David Greenberg loves learning. He works at Two Sigma, where he is leading the charge to open source interesting and useful pieces of custom infrastructure. His desire to learn has lead him to study Russian, and he enjoys practicing cooking techniques. He's currently interested in Rust, Clojure, and Distributed Systems with Mesos. He's the author of the book Building Applications on Mesos and the designer of Cook.
Follow David @dgrnbrg
We endeavor to build consistency, availability, and fault tolerance into our distributed systems, but how do we build them into our teams? The human factors in devops require as much attention as do our technical implementations.
Collaboration, understanding, trust: we know how important these interactions are in a devops practice, but how do we enable them between disparate team members, especially in a distributed team? My company's in California, and I'm nearly two thousand miles away in flyover country, USA. Being one of those little squares at the bottom of every video call gives me an outsider's perspective on the inside of our organizational optimization.
Drawing in comparisons from theoretical computer science and practical systems implementation, I'll explore how building understanding requires a practical application of great tools in a deliberate pursuit of a constructive culture.
Bridget Kromhout is a Principal Technologist for Cloud Foundry at Pivotal. Her CS degree emphasis was in theory, but she now deals with the concrete (if 'cloud' can be considered tangible). After years in site reliability operations (most recently at DramaFever), she traded in oncall for more travel. A frequent speaker at tech conferences, she helps organize the AWS and devops meetups at home in Minneapolis, serves on the program committee for Velocity, and acts as a global core organizer for devopsdays. She podcasts at Arrested DevOps, occasionally blogs at bridgetkromhout.com, and is active in a Twitterverse near you.
Follow Bridget @bridgetkromhout
Mr. Rippert is presently IBM's General Manager, Cloud Services Strategy and Business Development.
He began his career with Arthur Andersen & Co. Management Information Consulting Division in 1981 – the company now known as Accenture. While there, Rippert held many different positions including seven years as Accenture's CTO with customer-facing responsibility for technology. Responsibilities included areas such as research and development, technical architecture, business intelligence and infrastructure.
After retiring from Accenture, Mr. Rippert became the President and CEO of Basho Technologies, a fast growing NoSQL database provider. In 2014 Mr. Rippert joined IBM's Global Technology Services (GTS) organization as part of that group's cross – IBM cloud development program
Rippert holds two patents. One for distributed development environments and another for threat assessment using situational awareness. He is a graduate of University of Virginia and currently resides in Northern Virginia with his wife and five sons.
Follow Don @djrippert
Writing distributed systems is hard. Operating them at scale even harder. With Apache Mesos you've got a highly scalable and flexible kernel for a distributed OS at your disposal. By flexible I mean the capability of supporting both stateless and stateful services. A particularly interesting class of stateful services are NoSQL datastores and here in special Riak. With the recent introduction of persistent primitives in Mesos—dynamic reservations and persistent volumes—managing these sorts of workloads became much easier. There are still challenges, though, including transparent data volume migrations and explicit support for external resources, that is, enabling Mesos to manage cluster-wide resources more efficiently. We will discuss both the theoretical underpinnings and developments in this regard as well as pointing out main working areas in Mesos and the wider container community.
Benjamin Hindman is a co-founder of Mesosphere and co-creator of the Apache Mesos project. Ben was a PhD student at UC Berkeley before bringing Mesos to Twitter where it now runs on tens of thousands of machines powering Twitter's datacenters. He is now Chief Architect at Mesosphere, where they are building the Mesosphere Datacenter Operating System (DCOS). An academic at heart, his research in programming languages and distributed systems has been published in leading academic conferences.
Follow Ben @benh
Development practices for distributed systems promise the benefits of both flexibility and velocity, but confidence enables those benefits. As these systems evolve, they exceed the ability of any single human to mentally model. The future of software is complex, increasingly opaque even to the engineers building it. We can no longer rely on an architect to reason about these systems. If we can't reason about them, how can we have confidence in them?
An empirical, systems-based approach is needed to build confidence. We learn about the behavior of a distributed system by observing and experimenting on the system as it runs in production. We call this Chaos Engineering.
Chaos Engineering is a powerful practice that is already changing how software is designed and engineered at some of the largest-scale operations in the world. Where other practices address velocity and flexibility, Chaos specifically tackles systemic uncertainty in these distributed systems. The Principles of Chaos provide confidence to innovate quickly at massive scales and give customers the high quality experiences they deserve.
Casey is the Traffic and Chaos Engineering Manager at Netflix, with a mission to fortify availability in anticipation of failures. We respond to devastating outages in stride while preserving the quality of service for our customers. As an Executive Manager, Senior Architect, and Software Engineer, Casey has managed teams to tackle Big Data, architect solutions to difficult problems, and train others to do the same. He leverages experience with distributed systems, artificial intelligence, translating novel algorithms and academia into working models, and selling a 'vision of the possible' to clients and colleagues alike. For fun, he models human behavior using personality profiles in Ruby, Erlang, Prolog, and Scala.
Follow Casey @caseyrosenthal
In this talk I will present Transformations, a new vision for building principled distributed systems. I will show how searching for the appropriate abstractions leads to a new approach for building and reasoning about distributed systems.
Large-scale distributed systems change and evolve very fast as a result of adding functionality to the original system, deploying new hardware in the datacenter, responding to changes in the workload, and so on. The high rate of change in their design makes distributed systems harder to reason about and implement correctly over time. Consequently, we sought out a way to build and evolve large-scale distributed systems in a principled manner, using what we call Transformations.
Transformations offer a new way to reason about a large-scale distributed system's properties such as correctness, safety, fault-tolerance and availability as the system changes and evolves over time. Transformations introduces an algebra to reason about how distributed systems evolve over time. Moreover, using Transformations developers can transform an existing large-scale system and change its properties automatically while being able to reason about these changes. During my talk I will cover the main contributions of Transformations by first explaining the principles we use in Transformations and then showing how it can solve very challenging problems we face as distributed system developers: Sharding different components of an already replicated system while batching some of the requests and implementing encryption between select components, or taking a system replicated with Paxos and transforming it into a system replicated using Chain Replication.
Deniz Altinbüken is a PhD candidate in Systems at Cornell University, working with Robbert van Renesse. Her interests are in distributed systems and the theory of distributed computing with a focus on building infrastructure services for large-scale distributed systems. In addition to being a consensus enthusiast, currently, she is passionate about building principled distributed systems that are easy to reason about and implement correctly.
Follow Deniz @denizaltinbuken
Time Series and Queriability
Riak TS is a new Basho product built on the core architecture of Riak. Gordon Guthrie, the Eng Lead on Riak TS will discuss the how time series data drives unique functionality requirements including data co-location and querying and how they are addressed in Riak TS. He will also discuss IoT and time series data modeling, and show code examples of building tables and queries for time series use cases.
Gordon Guthrie is the Engeering Lead on Riak TS. He has been an Erlang programmer since 2003. Previous jobs include CEO/CTO at Hypernumbers where he built an arbitrary data query language for hierarchical spreadsheets. Gordon was also Chief Technical Architect at if.com and IT strategist at the RBS.
Follow Deniz @denizaltinbuken
Containerization makes it easier to package and deploy any application using a unified tool chain. As organizations being migrating, many have virtualized workloads that cannot be easily containerized, or application workloads such as static binaries and JVM applications that do not benefit from containerization. To address the growing heterogeneity of workloads, HashiCorp has developed the Nomad scheduler. Nomad is a globally aware, distributed scheduler designed to handle any type of workload on any operating system. Developers specify jobs using a high-level HCL specification, and Nomad manages the placement, scheduling, auto-healing and scaling automatically. In this talk we discuss the Nomad architecture and how it can be used to handle the challenges of scheduling in a modern datacenter.
Armon (@armon) has a passion for distributed systems and their application to real world problems. He is currently the CTO of HashiCorp, where he brings distributed systems into the world of DevOps tooling. He has worked on Nomad, Vault, Terraform, Consul, and Serf at HashiCorp, and maintains the Statsite and Bloomd OSS projects as well.
Follow Armon: @armon
About David ...
Follow David: @
Since the formalization of CRDTs (Conflict-free Replicated Data Types) several systems have adopted them to design sound approaches to eventual consistency, and now operate, at scale, with low latency over global networks. Instead of focussing on simple read and write operations, CRDTs try to capture the finer semantics of data type specific operations, thus enabling deterministic merging of concurrent data changes.
The initial state-based CRDT designs had a considerable meta-data overhead, since they had to provide a data format that is resilient to out-of-order transmission and message duplication. In this talk, while describing a reference implementation in C++, we will show how meta-data can be greatly reduced by using causality instead of tombstones. The examples will also illustrate how composition can assemble together and re-use basic building blocks, and how data dissemination among replicas can be further optimized by shipping small deltas in place of full replica state.
Carlos Baquero is an Assistant Professor and teaches Distributed Systems at Universidade do Minho, Portugal. He is a Senior Researcher at the High Assurance Laboratory within INESC-Tec. In the 90s, motivated by mobile computing and disconnected operation for file systems, he studied data types with merge (and fork) operations over semi-lattices, a precursor to state-based CRDTs. He is also interested in causality and in distributed aggregation algorithms. He specially likes all things distributed that eventually merge.
Follow Carlos: @xmal
Consistency is hard and coordination is expensive. As we move into the world of connected 'Internet of Things' style applications, or large-scale mobile applications, devices have less power, periods of limited connectivity, and operate over unreliable asynchronous networks. This poses a problem with shared state: how do we handle concurrent operations over shared state, while clients are offline, and ensure that values converge to a desirable result without making the system unavailable?
We look at a new programming model, called Lasp. This programming model combines distributed convergent data structures with a dataflow execution model designed for distribution over large-scale applications. This model supports arbitrary placement of processing node: this enables the user to author applications that can be distributed across data centers and pushed to the edge. In this talk, we will focus on the design of the language and show a series of sample applications.
Christopher Meiklejohn is a Senior Software Engineer with Machine Zone, Inc. working on distributed systems. Previously, Christopher worked at Basho Technologies, Inc. on the distributed key-value store, Riak. In his spare time, Christopher develops a programming language for distributed computation, called Lasp. Christopher is starting his Ph.D. studies at the UniversitÃ© catholique de Louvain in Belgium in 2016.
Follow Christopher @cmeik
When a bug is triggered during a distributed system's execution, developers need to understand what events caused their system to arrive at the unsafe state before they can begin to isolate and fix the bug's root cause. This process of troubleshooting can be highly time-consuming, as developers spend hours poring over multigigabyte traces containing thousands of events.
I will present a tool that reduces effort spent on troubleshooting distributed systems, by automatically eliminating events from a given buggy execution that are not causally related to the bug at hand. The developer is left with a ``minimal causal sequence'' of triggering events, each of which is necessary for triggering the bug. We claim that the greatly reduced size of the trace makes it easier for the developer to figure out which code path contains the underlying bug, allowing them to focus their effort on the task of fixing the problematic code itself.
In the talk I formally define the execution minimization problem we set out to solve, the techniques we have developed to solve it, and the system we have built to apply our techniques in practice.
I draw examples from real bugs we have found and minimized in Raft, Spark, and Pastry. Throughout, I outline how developers can find and minimize buggy executions for their own distributed systems.
Colin Scott is a PhD candidate at UC Berkeley. He works on distributed systems, where difficult bugs are a fact of life. His research focuses on developing practical methods to understand and fix bugs in real systems. Colin is an NSF Graduate Research Fellow. He received his MS from UC Berkeley in 2014, and his BS from University of Washington in 2011.
Follow Colin @_colin_scott_
At Booking.com, we have a constant flow of events coming from various applications and internal subsystems (1.5 Billions events per day). This critical data needs to be stored for real-time, medium and long term analysis.
Our events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a storage and analysis solution based on Riak.
First, the talk will briefly show real events examples, and how we serialize and aggregate them.
Then, Riak configuration and data modeling will be detailed, including how data are sent to Riak, and read out of it.
Next will be a section that demonstrates Riak flexibility solutions via 2 real examples: how we cut in half the cluster internal network usage, and how we used post-commit hook to perform real-time data crunching on the cluster nodes.
Finally, the talk will present our solution using Yokozuna to build a Time Series Database on top of Riak, for near real-time ad hoc analysis of a portion of our data flow.
Damien Krotkine is a software engineer at Booking.com (world's leading online hotel and accommodation reservations company). He currently works on the events subsystem, where he helps gathering, storing, managing and analyzing big quantities of data in real-time. Previously, he has been working in various fields like Linux Distribution, e-commerce, online real-time advertising. He's an active member of the Perl community, maintaining some NoSQL related modules ( Redis driver, Riak client, Bloomd client ... )
Follow Damien @damsieboy
As ShopKeep grows we must empower our developers to build and scale to meet the needs of the business. This talk looks at the genesis of ShopKeep, and how Riak has helped power our growth across the globe. This talk also has some dos and don'ts on building an actual service-based distributed system leveraging some awesome new technologies.
Duncan Grazier is the VP of Engineering at ShopKeep. They are building the future of point of sale. His responsibilities include scaling the teams, processes, and software that support the future of point of sale.
Follow Duncan @itsmeduncan
Panel Discussion with Heather Mckelvey, Matt Digan, Cuyler Jones and James Gorlick about the Basho Data Platform.
Heather McKelvey has more than 20 years of experience leading and managing enterprise software teams. Most recently, she was the CTO and SVP of engineering at GoGrid, where she led architecture and development of cloud services, inclusive of compute, SDN, storage and orchestration services. Prior to that, McKelvey was VP of engineering and operations at Mashery, an API management SaaS running in the cloud. Before joining Mashery, McKelvey held senior positions at MarketLive, Netscape, Compaq and Ingres. McKelvey works closely with the executive, product and engineering teams to deliver new services and capabilities that will further expand the attractiveness of Riak to companies seeking to harness NoSQL technologies for their unstructured data needs.
Basho delivers highly available, scalable distributed systems that are easy to operate at enterprise scale for empowering applications to store and retrieve unstructured data and easily manage active data workloads.
Founded in 2008, Basho was one of the first distributed systems companies dedicated to developing disruptive technology to simplify the most critical data management challenges for enterprises.
It currently has one of the largest and most talented groups of engineers and technical experts ever assembled, devoted exclusively to solving some of the most complex issues presented by scaling distributed systems.
Basho also enjoys a large and growing following among influential programmers, architects and academics, and is the organizer of RICON, a distributed systems conference that brings together many of today's leading luminaries in the expanding universe of enlightened distributed database management.
Follow Basho @basho
Panel Discussion with CRDT
Discussion with Riot Games
Panel Discussion with Desiree Van Welsum (OECD), Jonathan Murray
Sponsor Talks - reserved for HP
Panel discssion with Alex Williams (The New Stack), Dave McCrory (Basho Technologies), Nicholas Weaver (Intel), Mac Devine (IBM), Duncan Johnston Watt (Cloudsoft), Donnie Berkholz (451 Research).
Alex Williams is founder and editor in chief of The New Stack. He's a longtime technology journalist who did stints at TechCrunch, SiliconAngle and what is now known as ReadWrite. Alex has been a journalist since the late 1980s, starting at the Augusta Chronicle in 1989 after completing his master's degree from Northwestern University's Medill School of Journalism. Early in his career, he reported for newspapers in New York and Oregon, worked for a magazine writing about home textiles (ask him about it some time) and spent a year as a television business news anchor. Alex's online career began in 2003 when he did a web event called RSS WinterFest, which was followed by Podcast Hotel, an event all about the intersection of art and commerce and the impact digital media has on independent culture. While in college, Alex played baseball in France, which led him to writing stories of his experiences, and eventually a career in journalism.
Follow Alex @alexwilliams
Sqor Sports is a globally distributed social network built using Erlang and Riak. Our customers are some of the largest sports teams and biggest athletes on earth. In this session, I talk about some of the challenges we have tackled using Riak at our scale.
Technical Audience Level: Both advanced and beginner welcome
Adaptable technical leader, entrepreneur, software developer/architect/engineer. Over 20 years experience in leadership and engineering, including P&L; responsibility. I am also a data driven leader that is comfortable using mathematical modeling to solve complex problems.
Currently, CTO of Sqor Sports, where I built founding technology team from 3 people to currently over 40, while building a social network that serves the largest teams on earth from scratch. Have written code in most parts of the stack, but mostly focused on Datascience/Machine Learning and DevOps. Used Erlang from day 1. Additionally, functioning as General Manager (wearing all hats at the startup), with duties including sales/cash-flow generation, P&L; and general business operations.
Follow Noah @noahgift
In this talk I plan to discuss the design and implementation of geo-replicated data stores that go beyond offering weak consistency semantics, and instead offer causal consistency guarantees (with eventual convergence guarantees, often called Causal+ Consistency) or go beyond offering operations over single objects in the data store to offer transactions that manipulate multiple objects with isolation guarantees. The talk will be divided in three main sections.
In first section I will cover the motivation for providing stronger semantics than weak consistency on data stores illustrating the need for either causal consistency or transactional semantics with some simple examples. The talk will briefly cover the design and implementation of several existing solutions, addressing briefly the differences of solution in the design space. From this quick overview, the talk will then shift focus to two concrete solutions proposed by the academia in which I have participated in the design and implementation.
In the second section of the talk, I will focus on the design and implementation of ChainReaction, a geo-replicated data store that offers causal+ consistency. ChainReaction was implemented on top of the Fast Array of Wimpy Nodes (FAWN) datastore, which features a one-hop DHT (similar to the Riak one). I will discuss the variant of Chain Replication employed at the core ChainReaction, which trades write latency for fast causal consistent reads, and explain how this design is extended to a geo-replicated scenario.
In the final section of the talk I will address the design and implementation of Blotter, a geo-replicated data store that supports arbitrary transactions with an isolation level of non-monotonic snapshot isolation. At the core of Blotter we leverage on a special configuration of Paxos, which allows a single round-trip among the closest datacenters for committing a transaction.
João Leitão has a PhD in Computer Engineering (2012) from the Technical Institute of Lisbon, he also owns a Master degree (2007) and graduated in Computer Engineering from the Faculty of Sciences of the University of Lisbon. He is an Integrated Researcher in the NOVA LINCS laboratory of the NOVA University of Lisbon and an Invited Assistant Professor in the Computer Science department in the Faculdade de Ciências e Tecnologia of the same University.
His research interests are mostly focused on the design and implementation of large-scale systems ranging from cloud-based geo-replicated to peer-to-peer infrastructures, with emphasis on questions related to large-scale, consistency, fault-tolerance/reliability, efficiency, and security.
He is a professional member of the IEEE and ACM. Has two cats and occasionally (when time allows) practices Iaido and Jodo.
Follow João @jcaleitao
The journey of how a Sensor manufacturer went from a pc client connected to Postgres, to a online system with Postgres to a online system with riak.
The system we built is an online management and reporting tool for measuring data. There are thousands of online sensors all connecting at irregular times pushing data up and pulling configuration down. Clients and sensors are distributed over the globe requiring an always on system.
How do you use the strengths and coop with the limits of riak, both from a developer stand point as well as an operational one. How we designed the system to take full use of a distributed database as a core part of the system.
How do you break out of your rdbms shell and start think in terms of nosql, things like denormalise, time slice and other deterministic segmentations.
I run a small consulting company in Sweden, I've got a M.Sc in electrical engineering with focus on signals and systems. I've been running my company since 2010.
Our main focus is to help build and operate big and complex systems.
My main focus when building systems has always been scaleability availability, manageability. Riak has been my number one choice, since I've came across it in 2012, when nosql is an option.
Follow Johan @s2hc
Follow Sarah @
Please join me as we explore some of the important characteristics of a reliable and efficient distributed system, such as scalability, availability, performance, latency, fault tolerance and learn how those same attributes can and should be implemented across all areas of your IT infrastructure.
Kevin Jones is an Engineer with NGINX.
Follow Kevin @webopsx
Does it bother you when all your microservice logs show up out of order in your log analysis tool because your system clocks aren't synchronized? Did you ever wonder what time it was when the dotted version vector on a piece of data became 2.4.5? If so, then this is the talk for you!
Time is a notoriously difficult concept in distributed systems. Perfect clock synchronization is impossible, and most logical timestamp schemes (like Lamport clocks) don't bear a relationship to "wall clock time" that is meaningful to human operators.
In this talk, we'll learn about distributed monotonic clocks (DMC) whose timestamps can reflect causality but which have a component that stays close to wall clock time. This scheme builds on previous hybrid logical clock proposals by adding important operational hooks and by building in mechanisms to prevent a single runaway system clock from dragging a whole cluster's logical clocks forward into the future.
DMCs can be implemented as a "piggyback protocol" suitable for transport in existing application messages--for example, as additional HTTP headers. I'll describe a novel coordination protocol in this unusual piggyback style--a style that may also be useful in ad-hoc clusters formed by Internet-attached "things".
Jon Moore is a Senior Fellow at Comcast Cable, where he leads the Core Application Platforms group that focuses on building scalable, performant, robust software components for the company's varied software product development groups. His current interests include distributed systems, hypermedia APIs, fault tolerance, and Texas Hold'em. Jon received his Ph.D. in Computer and Information Science from the University of Pennsylvania and currently resides in West Philadelphia, although he was neither born there nor raised there and does not spend many of his days on playgrounds.
Follow Jon @jon_moore
Modern NoSQL systems have made it significantly easier to build complex, constantly evolving applications. These systems' support for data-models that allow nested data and dynamic schemas allows developers to quickly prototype and deploy new features. While their support for flexible data-models is a boon for developers, several NoSQL systems have eschewed one of the main programmer-friendly abstractions of traditional DBMSs; atomic transactions. Atomicity guarantees that a group of writes to multiple data items are performed in an all-or-nothing fashion. Atomicity frees application developers from writing error-prone corner-case code to deal with scenarios in which only a subset of a transactions' writes succeeds.
The reason that these systems eschew atomic transactions is their seemingly prohibitive performance cost, particularly in distributed settings. In a distributed database system, atomicity requires all the machines involved in a given transaction to coordinate with each other. In particular, if a transaction reads or writes data residing on several machines, then atomicity requires each such machine to coordinate with every other machine involved in the transaction. To circumvent this distributed coordination, several popular NoSQL systems give up on general atomic transactions, and instead restrict their scope to a single key or partition.
This talk takes a step back and asks; does distributed coordination necessarily preclude performant general atomic transactions? The answer is a resounding NO. This talk analyzes the distributed coordination necessary for general atomic transactions, and describes its impact on three key performance and correctness properties; fairness, isolation, and throughput. We will find that there exists a three-way tradeoff between fairness, isolation, and throughput (FIT); a system which supports general atomic transactions can achieve at most two of these three properties simultaneously. Database architects can use the FIT tradeoff to reason about the performance and correctness tradeoffs associated with general atomic transactions in a principled fashion.
Jose Faleiro is a fourth year PhD student at Yale University, where he is advised by Daniel Abadi. His research interests lie in concurrency control and recovery for main-memory multi-core database systems. More broadly, he is interested in dealing with concurrency in parallel and distributed systems.
Follow Jose @jmfaleiro
We will start with an empty git repository and cover all the steps, technologies and concepts needed to build a riak_core based backend.
The topics covered will include:
Creator and maintainer of efene (a programming language for the Erlang VM), Co Founder of Event Fabric, an application to integrate, manipulate, understand and visualize data from multiple sources in real time from your web browser.
Follow Luis Mariano @warianoguerra
Manav is a product management executive with several years of experience in delivering consumer and enterprise product and developer platforms for mobile, desktop and cloud-based environments.
At HP, Manav heads the product management team responsible for envisioning, developing and delivering the Helion Development Platform for today's enterprise. Helion Development Platform, built on Cloud Foundry, OpenStack and Docker, is a modern, open, application development platform for building and deploying cloud-native apps .
Follow Manav Mishra @manavm
Spend a session reflecting on the foundational papers and people behind distributed systems in a fun and quick moving context. We will take a whirlwind tour through the ideas embodied in the classic papers that form the basis for distributed systems. You may be surprised, hopefully you will be delighted and you will leave with a great bibliography for further inquiry and professional biographies of the people behind the papers.
Mark has been developing software for over 20 years. For the past several years, he has been building distributed systems in Erlang, most recently at Basho. Mark has spoken at several conferences including Erlang Factory, RICON, OSCON and Strange Loop.
Follow Mark @bytemeorg
We all know the scenario: what was once an experiment soon becomes a desired product. Suddenly it's time to scale the system, and all the time we hear about running "at scale", but what does that really mean? This talk is about the challenges OpenX faces when scaling a small proof-of-concept to a complex global product. It focuses on our strategies for planning, adapting, and growing distributed systems into fully productionized services. Operational best practices factor in strongly; statistics gathering, careful capacity planning, and precise tooling help pull off migrations and upgrades with little impact. A close relationship with Engineering and working in a constant feedback loop with developers is another part of the puzzle, helping the entire team launch successfully. Through these examples we'll attempt to de-mystify distributed systems, move beyond those basic five-node clusters, and find how the approach to scaling something like Riak is equally applicable to many other kinds of highly available clusters.
Matt Davis is a Sr. Site Reliability engineer in Platform Engineering at OpenX. As an expert on running some of the largest Riak clusters on the planet, he has given talks about operating distributed platforms at several venues including hackerspace meetings, RICON and SCaLE. Known to build electro-acoustic instruments for musical improvisation, he likes to combine art and music philosophies into his approaches for managing complex computer systems.
Follow Matt @dtauvdiodr
We present Hyperbahn, a service discovery and routing system developed at Uber. Hyperbahn implements some novel features such as end to end rate limiting, circuit breaking, and backup requests with cross server cancellation. This talk will cover Hyperbahn and how we do RPCs at Uber.
Matt is a Senior Staff Engineer at Uber where he works on architecture and distributed systems. Before that, he was co-founder of Voxer.
Follow Matt @mranney
Many newer data stores, distributed ones in particular, carry a great deal of promise. However, many bright-eyed engineers find themselves struggling just to keep the darn thing up instead of experiencing the operational nirvana described in the documentation and marketing materials. What gives?
The reality is that even the "simplest" data stores are complex beasts, and the way they fit into YOUR infrastructure and applications makes them even more so. Learning to be truly "at peace" with even a single database technology can take years. Achieving moderate proficiency and competency usually takes less time, but the act of learning itself can be treated as a skill and honed.
In this talk we'll talk about just that - how to get good at getting to know a new system, and how to understand its limitations BEFORE your production workload is running up against them and falling down.
No one wants to spend their whole life just trying out different databases, but the perpetual tire fire that often results from a poorly chosen one is no way to live either. Let's make better decisions, together!
Mike is the CTO of Opsmatic. He works with his team on building a reliable service for making the daily lives of frontline operators and engineers better. He came to this gig after experiencing much pain during his own daily life, building and operating a few large-scale infrastructures at Urban Airship, SimpleGeo, Flickr, and Yahoo!
Follow Mikhail @mihasya
Today's NoSQL databases provide the perfect framework for big data applications with the ability to scale and use more resources for both storage and compute.They also provide WAN replication capabilities and SQL support. It seems that one of the challenges that still exists in NoSQL databases is the ability to perform distributed transactions.
Distributed transactions are an essential need for finance, e-commerce, telco and other industry applications. One use case for distributed transactions is a money transfer between 2 accounts which are stored in two separated partitions. The challenge here would be to lock both partitions for the transactions and still provide high throughput, reliable data and high availability. In this session, we would like to discuss the ways to achieve distributed transactions on Big Data frameworks.
DeWayne Filppi is a software technologist with broad and deep industry experience, ranging from pre-sales engineering, post-sales consulting, through product design, development, architecture, and management. He mainly focuses on high performance server platforms and cloud computing.
Follow DeWayne @dfilppi
The Internet was not designed to support 50 Billion devices fanning-in Brontobytes of mostly low value updates. And yet every major vendor, analyst and grandmother will tell you that the Internet of Things (IoT) is on the cusp of doing just that. Smart organizations are optimizing traffic, delivering insights and enforcing security closer to the devices critical to their business with distributed IoT Platforms. Distributing insight and security in IoT has several unique challenges from identifying node device shadows to maintaining decision context to the resource limitations of many edge devices. We'll examine how to address these challenges with compose-able nodes, context priority and audit-able IoT framework components.
Dr. Sarah Cooper is M2Mi's COO responsible for engineering, business development and platform strategy. M2Mi, founded in 2006 to make a Connected World a reality, has a distributed, compose-able IoT Platform consisting of 7 modules spanning Monetization, Connection Management, Device Control, In-stream Analytics and Security/Privacy. Sarah is a 2015 Silicon Valley Woman of Influence, a recognized Top 100 Wireless Technology Expert by Wireless World, one of Connected World's Women of M2M, a member of OASIS MQTT TC and SmartGrid Interoperability Panel (SGIP). She holds multiple patents and a PhD is Physics from University of Sydney.
Follow Sarah @SMC_on_IoT
Sarah Novotny is ... TBA
Follow Sarah @sarahnovotny
Chain replication, a variation of primary-backup replication, is an increasingly-common technique for maintaining strong consistency semantics within a data store. To date, more research effort has gone into formalizing the data replication technique than into formalizing management of its metadata: i.e., what decides the chain's order, and decides when the chain order may change without violating consistency constraints.
This talk introduces Humming Consensus, a new technique for managing chain replication metadata. This metadata defines cluster membership, server order within the chain, and changes to chain order in response to peer failure. All participants in Humming Consensus are equal peers that operate without external assistance from ZooKeeper or other coordination service. Humming Consensus may also be used to manage chain replication-based eventual consistency data stores. The talk will also briefly introduce Basho's new distributed file store, Machi, which relies upon Humming Consensus to operate either in strongly consistent or eventually consistent environments. Attendees will need no prior knowledge of Erlang: the talk will focus on Machi's design and testing methods rather than its implementation internals.
Scott Lystig Fritchie met his first UNIX system in 1986 and has almost never met one since that he didn't like. A career detour as a UNIX systems administrator got him neck-deep in messaging systems, e-mail, and Usenet News. He rediscovered full-time programming while at Sendmail, Inc. and has been writing code for, designing, and testing distributed systems ever since. Scott Lystig Fritchie met his first UNIX system in 1986 and has almost never met one since that he didn't like. A career detour as a UNIX systems administrator got him neck-deep in messaging systems, e-mail, and Usenet News. He rediscovered full-time programming while at Sendmail, Inc. and has been writing code for, designing, and testing distributed systems ever since.
Scott is a senior software engineer at Basho Japan KK.
Follow Scott @slfritchie
Would you believe that Riak 2.0 has everything you need to build a distributed queue? That was the question a few engineers set out to answer, after a night out of drinks and disagreement.
In this presentation, we'll talk about an in-house distributed queue we built in Golang, called Dynamiq. We'll cover why we chose Riak and Golang, the design behind the system, and most importantly how it works, and how we put it into production. We'll cover the clever ways you can use Riak to do "the hard parts" of a distributed workload, and build a system on top of it that extends it's core functionality in new ways.
I've been a software engineer for over 10 years, working on energy management systems, web analytics, health monitoring software, and finally landing my dream job: putting ads on phones. At Tapjoy we're constantly pushing Riak to it's limits, and are using it in a number of ways to power our infrastructure.
I'm the author and maintainer of several Ruby and Golang projects of incredibly dubious value, and enjoy traveling, drinking hoppy IPAs, and eating fine cured meats with my wife. At some point in my life, I picked up the nickname Stabby, although I swear I'm a very live and let live kind of person.
Follow Sean @StabbyCutyou
In recent years, researchers and practitioners have been trying to ensure applications correctness while providing low-latency and highly available operations in geo-replicated systems. Despite these continuous efforts, it is well known that many types of operations must rely on cross-replica coordination to ensure that application invariants are preserved at all times. This is a fundamental limitation on improving services, as many companies cannot afford the extra latency to ensure consistency across replicas nor the reduced availability during faults, as this directly translates into revenue losses. In this talk, we discuss how to reduce cross-replica coordination while still preserving correctness properties by exploiting applications semantics.
The talk is divided in two parts. In the first part, we show how to preserve global numeric invariants on top of an existing production KV-store, while moving most coordination outside of the critical path of operation execution. We achieve this using a new replicated data-type to maintain the necessary information and show experimental results for a proof-of-concept prototype.
In the second part, we explore the more ambitious vision of providing generic application invariants with virtually no coordination. The fundamental insight is to identify and prevent the execution of concurrent operations, at different sites, that would result in invariant violations when the effects of the operations are merged together. We combine a static analysis tool and an online concurrency control mechanism to avoid unsafe executions.
Valter Balegas a third year PhD Student in Universidade Nova de Lisboa currently working under the supervision of Nuno Preguiça, one of the creators of CRDTs. The context of his PhD is the improvement of correctness properties in geo-replicated systems without diminishing their availability and latency properties. He has recently proposed Explicit Consistency, a new consistency model that allows efficient implementation of applications that preserve invariants on top of weak consistency models. In the past he has made an internship at Basho working on CRDTs.
Research interests include Distributed Systems, Scalable Data Stores, Geo-Replication and home brewing.
Follow Valter @vbalegas
This one-day course will provide the conceptual overview of topics necessary to understanding application development on Riak, a discussion of advanced developer topics such as access patterns, querying and data modeling best practices, as well as hands-on experience in developing an API which uses Riak as a high-throughput scalable data store.
Developers will be expected to arrive with the following items prepared:
This section will introduce developers to the vocabulary and concepts used in Riak application development. The potential list of topics (the specific list will be tailored to the experience level of the participants) includes:
This section will introduce developers to the vocabulary and concepts used in Riak application development. The potential list of topics (the specific list will be tailored to the experience level of the participants) includes:
Developers will install Riak and relevant client libraries. For developer convenience, VirtualBox VMs will be provided (on USB drives), provisioned
Participants will be led through a hands-on dev lab, which will illustrate the access patterns, techniques and best practices necessary for developing real-world applications on Riak, in the language of their choice (see the Pre-Requisites section for a list of supported languages). Sample code and libraries will be provided for developer convenience.
As time permits.
Get a demonstration and hands-on experience with the tools, skills, and techniques needed to administer a Riak cluster. Handle real life situations including configuring multiple back-ends, upgrading while in use and adding monitoring to the cluster.
Participants in Ops Training will be expected to arrive with the following items prepared: