Logarithm: A logging engine for AI coaching workflows and providers

  • Methods and utility logs play a key function in operations, observability, and debugging workflows at Meta.
  • Logarithm is a hosted, serverless, multitenant service, used solely internally at Meta, that consumes and indexes these logs and offers an interactive question interface to retrieve and look at logs.
  • On this put up, we current the design behind Logarithm, and present the way it powers AI coaching debugging use circumstances.

Logarithm indexes 100+GB/s of logs in actual time, and 1000’s of queries a second. We designed the system to help service-level ensures on log freshness, completeness, sturdiness, question latency, and question end result completeness. Customers can emit logs utilizing their alternative of logging library (the widespread library at Meta is the Google Logging Library [glog]). Customers can question utilizing common expressions on log traces, arbitrary metadata fields connected to logs, and throughout log recordsdata of hosts and providers.

Logarithm is written in C++20 and the codebase follows trendy C++ patterns, together with coroutines and async execution. This has supported each efficiency and maintainability, and helped the group transfer quick – growing Logarithm in simply three years.

Logarithm’s knowledge mannequin

Logarithm represents logs as a named log stream of (host-local) time-ordered sequences of immutable unstructured textual content, akin to a single log file. A course of can emit a number of log streams (stdout, stderr, and customized log recordsdata). Every log line can have zero or extra metadata key-value pairs connected to it. A typical instance of metadata is rank ID in machine studying (ML) coaching, when a number of sequences of log traces are multiplexed right into a single log stream (e.g., in PyTorch).

Logarithm helps typed constructions in two methods – through typed APIs (ints, floats, and strings), and extraction from a log line utilizing regex-based parse-and-extract guidelines – a typical instance is metrics of tensors in ML mannequin logging. The extracted key-value pairs are added to the log line’s metadata.

Determine 1: Logarithm knowledge mannequin. The containers on textual content characterize typed constructions.

AI coaching debugging with Logarithm

Earlier than taking a look at Logarithm’s internals, we current help for coaching programs and mannequin challenge debugging, one of many outstanding use circumstances of Logarithm at Meta. ML mannequin coaching workflows are likely to have a variety of failure modes, spanning knowledge inputs, mannequin code and hyperparameters, and programs parts (e.g., PyTorch, knowledge readers, checkpointing, framework code, and {hardware}). Additional, failure root causes evolve over time quicker than conventional service architectures as a result of rapidly-evolving workloads, from scale to architectures to sharding and optimizations. To be able to triage such a dynamic nature of failures, it’s crucial to gather detailed telemetry on the programs and mannequin telemetry.

Since coaching jobs run for prolonged intervals of time, coaching programs and mannequin telemetry and state should be constantly captured so as to have the ability to debug a failure with out reproducing the failure with further logging (which might not be deterministic and wastes GPU sources).

Given the dimensions of coaching jobs, programs and mannequin telemetry are typically detailed and really high-throughput – logs are comparatively low cost to put in writing (e.g., in comparison with metrics, relational tables, and traces) and have the data content material to energy debugging use circumstances.

We stream, index and question high-throughput logs from programs and mannequin layers utilizing Logarithm.

Logarithm ingests each programs logs from the coaching stack, and mannequin telemetry from coaching jobs that the stack executes. In our setup, every host runs a number of PyTorch ranks (processes), one per GPU, and the processes write their output streams to a single log file. Debugging distributed job failures results in ambiguity as a result of lack of rank info in log traces, and including it signifies that we modify all logging websites (together with third-party code). With the Logarithm metadata API, course of context comparable to rank ID is connected to each log line – the API provides it into thread-local context and attaches a glog handler.

We added UI instruments to allow widespread log-based interactive debugging primitives. The next figures present screenshots of two such options (on high of Logarithm’s filtering operations).

Filter–by-callsite permits hiding recognized log traces or verbose/noisy logging websites when strolling by means of a log stream. Strolling by means of a number of log streams side-by-side permits discovering rank state that’s completely different from different ranks (i.e., further traces or lacking traces), which usually is a symptom or root trigger. That is instantly a results of the only program, a number of knowledge nature of manufacturing coaching jobs, the place each rank iterates on knowledge batches with the identical code (with batch-level limitations).

Determine 2: Logarithm UI options for coaching programs debugging (Logs proven are for demonstration functions).

Logarithm ingests steady mannequin telemetry and abstract statistics that span mannequin enter and output tensors, mannequin properties (e.g., studying fee), mannequin inner state tensors (e.g., neuron activations) and gradients throughout coaching. This powers reside coaching mannequin monitoring dashboards comparable to an inner deployment of TensorBoard, and is utilized by ML engineers to debug mannequin convergence points and coaching failures (as a result of gradient/loss explosions) utilizing notebooks on uncooked telemetry.

Mannequin telemetry tends to be iteration-based tensor timeseries with dimensions (e.g., mannequin structure, neuron, or module names), and tends to be high-volume and high-throughput (which makes low-cost ingestion in Logarithm a pure alternative). Collocating programs and mannequin telemetry permits debugging points that cascade from one layer to the opposite. The mannequin telemetry APIs internally write timeseries and dimensions as typed key-value pairs utilizing the Logarithm metadata API. Multimodal knowledge (e.g., pictures) are captured as references to recordsdata written to an exterior blob retailer.

Mannequin telemetry dashboards sometimes are typically a lot of timeseries visualizations organized in a grid – this permits ML engineers to eyeball spatial and temporal dynamics of the mannequin exterior and inner state over time and discover anomalies and correlation construction. A single dashboard therefore must get a considerably giant variety of timeseries and their tensors. To be able to render at interactive latencies, dashboards batch and fan out queries to Logarithm utilizing the streaming API. The streaming API returns outcomes with random ordering, which permits dashboards to incrementally render all plots in parallel – inside 100s of milliseconds to the primary set of samples and inside seconds to the complete set of factors.

Determine 3: TensorBoard mannequin telemetry dashboard powered by Logarithm. Renders 722 metric time collection without delay (complete of 450k samples).

Logarithm’s system structure

Our aim behind Logarithm is to construct a extremely scalable and fault tolerant system that helps high-throughput ingestion and interactive question latencies; and offers sturdy ensures on availability, sturdiness, freshness, completeness, and question latency.

Determine 4: Logarithm’s system structure.

At a excessive degree, Logarithm includes the next parts:

  1. Utility processes emit logs utilizing logging APIs. The APIs help emitting unstructured log traces together with typed metadata key-value pairs (per-line).
  2. A bunch-side agent discovers the format of traces and parses traces for widespread fields, comparable to timestamp, severity, course of ID, and callsite.
  3. The ensuing object is buffered and written to a distributed queue (for that log stream) that gives sturdiness ensures with days of object lifetime.
  4. Ingestion clusters learn objects from queues, and help further parsing primarily based on any user-defined regex extraction guidelines – the extracted key-value pairs are written to the road’s metadata.
  5. Question clusters help interactive and bulk queries on a number of log streams with predicate filters on log textual content and metadata.

Logarithm shops locality of knowledge blocks in a central locality service. We implement this on a hosted, extremely partitioned and replicated assortment of MySQL situations. Each block that’s generated at ingestion clusters is written as a set of locality rows (one for every log stream within the block) to a deterministic shard, and reads are distributed throughout replicas for a shard. For scalability, we don’t use distributed transactions for the reason that workload is append-only. Word that for the reason that ingestion processing throughout log streams just isn’t coordinated by design (for scalability), federated queries throughout log streams could not return the identical last-logged timestamps between log streams.

Our design selections focus on layering storage, question, and log analytics and ease in state distribution. We design for 2 widespread properties of logs: they’re written greater than queried, and up to date logs are typically queried greater than older ones.

Design choices

Logarithm shops logs as blocks of textual content and metadata and maintains secondary indices to help low latency lookups on textual content and/or metadata. Since logs quickly lose question probability with time, Logarithm tiers the storage of logs and secondary indices throughout bodily reminiscence, native SSD, and a distant sturdy and extremely out there blob storage service (at Meta we use Manifold). Along with secondary indices, tiering additionally ensures the bottom latencies for essentially the most accessed (latest) logs.

Light-weight disaggregated secondary indices. Sustaining secondary indices on disaggregated blob storage magnifies knowledge lookup prices at question time. Logarithm’s secondary indices are designed to be light-weight, utilizing Bloom filters. The Bloom filters are prefetched (or loaded on-query) right into a distributed cache on the question clusters when blocks are revealed on disaggregated storage, to cover community latencies on index lookups. We later added help for knowledge blocks within the question cache when executing a question. The system tries to collocate knowledge from the identical log stream to be able to scale back fan outs and stragglers throughout question processing. The logs and metadata are carried out as ORC recordsdata. The Bloom filters at present index log stream locality and metadata key-value info (i.e., min-max values and Bloom filters for every column of ORC stripes).

Logarithm separates compute (ingestion and question) and storage to quickly scale out the quantity of log blocks and secondary indices. The exception to that is the in-memory memtable on ingestion clusters that buffer time-ordered lists of log streams, which is a staging space for each writes and reads. The memtable is a bounded per-log stream buffer of the newest and lengthy sufficient time window of logs which are anticipated to be queried. The ingestion implementation is designed to be I/O-bound and never compute or host reminiscence bandwidth-heavy to deal with near GB/s of per-host ingestion streaming. To attenuate memtable rivalry, we implement a number of memtables, for staging, and an immutable prior model for serializing to disk. Ingestion implementation follows zero-copy semantics.

Equally, Logarithm separates ingestion and question sources to make sure bulk processing (ingestion) and interactive workloads don’t influence one another. Word that Logarithm’s design makes use of schema-on-write, however the knowledge mannequin and parsing computation is distributed between the logging hosts (which scales ingestion compute), and optionally, the ingestion clusters (for user-defined parsing). Clients can add further anticipated capability for storage (e.g., elevated retention limits), ingestion and question workloads.

Logarithm pushes down distributed state upkeep to disaggregated storage layers (as a substitute of replicating compute at ingestion layer). The disaggregated storage in Manifold makes use of read-write quorums to offer sturdy consistency, sturdiness and availability ensures. The distributed queues in Scribe use LogDevice for sustaining objects as a sturdy replicated log. This simplifies ingestion and question tier fault tolerance. Ingestion nodes stream serialized objects on native SSDs to Manifold in 20-min. epochs, and checkpoint Scribe offsets on Manifold. When a failed ingestion node is changed, the brand new node downloads the final epoch of knowledge from Manifold, and restarts ingesting uncooked logs from the final Scribe checkpoint.

Ingestion elasticity. The Logarithm management airplane (primarily based on Shard Supervisor) tracks ingestion node well being and log stream shard-level hotspots, and relocates shards to different nodes when it finds points or load. When there is a rise in logs written in a log stream, the management airplane scales out the shard rely and allocates new shards on ingestion nodes with out there sources. The system is designed to offer useful resource isolation at ingestion-time between log streams. If there’s a vital surge in very quick timescales, the distributed queues in Scribe take in the spikes, however when the queues are full, the log stream can lose logs (till elasticity mechanisms improve shard counts). Such spikes sometimes are likely to end result from logging bugs (e.g., verbosity) in utility code.

Question processing. Queries are routed randomly throughout the question clusters. When a question node receives a request, it assumes the function of an aggregator and partitions the request throughout a bounded subset of question cluster nodes (balancing between cluster load and question latency). The aggregator pushes down filter and type operators to question nodes and returns sorted outcomes (an end-to-end blocking operation). The question nodes learn their partitions of logs by trying up locality, adopted by secondary indices and knowledge blocks – the learn can span the question cache, ingestion nodes (for most up-to-date logs) and disaggregated storage. We added 2x replication of the question cache to help question cluster load distribution and quick failover (with out ready for cache shard motion). Logarithm additionally offers a streaming question API with randomized and incremental sampling that returns filtered logs (an end-to-end non-blocking operation) for lower-latency reads and time-to-first-log. Logarithm paginates end result units.

Logarithm can tradeoff question end result completeness or ordering to take care of question latency (and flag to the shopper when it does so). For instance, this may be the case when a partition of a question is sluggish or when the variety of blocks to be learn is simply too excessive. Within the former, it instances out and skips the straggler. Within the latter situation, it begins from skipped blocks (or offsets) when processing the subsequent end result web page. In observe, we offer ensures for each end result completeness and question latency. That is primarily possible for the reason that system has mechanisms to cut back the probability of root causes that result in stragglers. Logarithm additionally does question admission management at shopper or user-level.

The next figures characterize Logarithm’s combination manufacturing efficiency and scalability throughout all log streams. They spotlight scalability on account of design selections that make the system easier (spanning disaggregation, ingestion-query separation, indexes, and fault tolerance design). We current our manufacturing service-level targets (SLOs) over a month, that are outlined because the fraction of time they violate thresholds on availability, sturdiness (together with completeness), freshness, and question latency.

Determine 5: Logarithm’s ingestion-query scalability for the month of January 2024 (one level per day).
Determine 6: Logarithm SLOs for the month of January 2024 (one level per day).

Logarithm helps sturdy safety and privateness ensures. Entry management will be enforced on a per-log line granularity at ingestion and query-time. Log streams can have configurable retention home windows with line-level deletion operations.

Subsequent steps

Over the previous couple of years, a number of use circumstances have been constructed over the foundational log primitives that Logarithm implements. Methods comparable to relational algebra on structured knowledge and log analytics are being layered on high with Logarithm’s question latency ensures – utilizing pushdowns of search-filter-sort and federated retrieval operations. Logarithm helps a local UI for interactive log exploration, search, and filtering to assist debugging use circumstances. This UI is embedded as a widget in service consoles throughout Meta providers. Logarithm additionally helps a CLI for bulk obtain of service logs for scripting analyses.

The Logarithm design has centered round simplicity for scalability ensures. We’re constantly constructing domain-specific and agnostic log analytics capabilities inside or layered on Logarithm with applicable pushdowns for efficiency optimizations. We proceed to put money into storage and query-time enhancements, comparable to light-weight disaggregated inverted indices for textual content search, storage layouts optimized for queries and distributed debugging UI primitives for AI programs.


We thank Logarithm group’s present and previous members, notably our leads: Amir Alon, Stavros Harizopoulos, Rukmani Ravisundaram, Laurynas Sukys, Hanwen Zhang; and our management: Vinay Perneti, Shah Rahman, Nikhilesh Reddy, Gautam Shanbhag, Girish Vaitheeswaran, and Yogesh Upadhay. Thanks to our companions and clients: Sergey Anpilov, Jenya (Eugene) Lee, Aravind Ram,  Vikram Srivastava, and Mik Vyatskov.