tldr: Parameter Server
Part of the tldr series , which summarizes papers I'm currently reading. Inspired by the morning paper . Scaled Distributed Machine Learning with the Parameter Server [2014] Li et. al Introduction The motivation for this paper is efficiently solving large-scale machine learning problems by distributing the work over worker nodes. Fault tolerance is also a major goal of this paper: training tasks are often run in the cloud, which means having to deal with unreliable machines. Before going into parameter servers, here's a quick primer on machine learning in relation to systems. The goal of ML can be thought of as finding a model , which is just a function approximation. For example, the function could be f(user profile) , which is equal to the likelihood distribution of that user clicking on an ad. Machine learning happens in two parts: Training . Exposing the model to training data so the model can improve. This is an iterative process. Inference . Testing th