The main component of profefe, a system for continuous profiling, is profefe-collector — a service that receives profiling data, taken from an application and persists the data in collector’s storage (design document describes it in more details). Receiving data from an external source (for example, profefe-agent), indicates that profefe, as an observability system, prefers “push” model. Why not “pulling” the profiles directly from the application?
Both push and pull models have their benefits and drawbacks.
A collector that pulls profiling data from running applications could simplify integration into existing infrastructure because there would be no need in making changes in the applications that already exposed pprof HTTP endpoint. Making sure that every application integrated and configured profefe-agent would be a challenging job in a large organisation.
On the other hand, pull model requires pprof servers to be exposed and available for the collector, so it could fetch (pull) profiling data. That can also be challenging in the deployments, where applications are collocated on the bare-metal machines. Every application (application’s instance) would have to communicate a unique TCP port for its pprof server.
To work as a pull-system, the collector must be able to discover the pprof servers, thus it requires a mechanism for service discovery (SD), to be usable at scale. Unfortunately, there isn’t a universal SD protocol or a provider, an observability system could be built upon.
Prometheus, the best example of an open-source system, which uses pull model for data collection, have to support several different SD systems in their code base. At some point they ended up introducing their own general protocol, that expects a “middle-man-service”, which translates the data from a SD system into a list of Prometheus targets (Update, this comment from u/bbrazil does a better job explaining the state of SD in Prometheus). There is no clear way for an open-source system to be both flexible and don’t end up being a pile of “plugins”, that no one is willing to maintain or break.
From the start of profefe project, several years ago, I had the idea that translating push into pull would be easier for an end-user. That is if a small deployment already exposes a pprof server, writing an external job that pulls the profiles from the applications, annotates them with meta-data, and pushes the data into the collector, can be as easy as spawning a cronjob in a sidecar. kube-profefe solves that nicely for deployments running in Kubernetes. At some point, I hoped to come up with something similar for Nomad+Consul if the experiments ended successfully.
Translating pull into push is a similarly possible but because profefe didn’t have to support any SD mechanisms from the start, that simplified the overall code base and allowed us to focus on the collector and the API for profiles quering.
profefe-collector does uses push model. But one can deploy profefe so it reflected the use cases your organisation has.
Do you use continuous profiling? Let me know about your experience. Share your thoughts on Twitter or discuss on r/golang.