Topics: C, C++, Debugging, Memory, Profiling

Debugging and Profiling C and C++ with Valgrind

Updated May 6, 2025

“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”

- Brian W. Kernighan

Introduction

Memory bugs are among the most expensive and pain-in-the-ass bugs in existence. A buffer overflow can go unnoticed for many months, the program continues running, producing slightly wrong output, corrupting data structures silently, until one day it crashes in production in a way that is completely impossible to reproduce in a debugger. Use-after-free bugs can be exploited to achieve arbitrary code execution. Memory leaks can bring down long-running servers over the course of days (this is the reason why people have to reset their servers after a couple of weeks).

The problem is that C and C++ give us the programmer direct control over memory but provide no runtime enforcement, so of course, we can easily shoot ourself in the foot. The language standard says that accessing memory out-of-bounds, reading an uninitialized variable, or dereferencing a freed pointer is undefined behavior - meaning the compiler and runtime are free to do absolutely anything, including appearing to work correctly while silently corrupting state elsewhere. The bug and its visible effect can be arbitrarily far apart in both time and program location.

Valgrind is a framework that solves this by running your program under a watchful synthetic CPU that observes every memory operation and raises an error the moment undefined behavior occurs, not when the crash eventually happens, but at the exact instruction that causes it.

How Valgrind Works: Dynamic Binary Instrumentation

When you run valgrind ./my_program, your program is never executed directly by the CPU. Instead, Valgrind's core engine reads the binary, disassembles it into an intermediate representation called VEX IR, hands it to the active tool (e.g., Memcheck) for instrumentation - the tool injects its own analysis code - and then JIT-compiles the instrumented IR back into native machine code, which is what the CPU actually runs. Every basic block goes through this pipeline the first time it executes; subsequent executions use the cached translated block.

This technique is called Dynamic Binary Instrumentation (DBI). The key consequence is that Valgrind operates entirely at the binary level. It does not need your source code, does not require recompilation with special flags (though debug symbols help enormously for readable output), and does not need to be told in advance what to instrument - it sees everything. The downside is cost: the JIT overhead plus the analysis code added by each tool makes programs run roughly 10–50× slower under Valgrind, depending on the tool.

For comparison, compile-time sanitizers like AddressSanitizer (ASan) and MemorySanitizer (MSan) work differently: they instrument the code at compile time and run with only around 2–3x overhead. Valgrind's advantage over them is that it works on any binary, including prebuilt third-party libraries, and does not require source access. Its disadvantage is the much larger overhead and the fact that it cannot instrument stack accesses as precisely as ASan.

Installation and Compilation

On Debian/Ubuntu:

sudo apt install valgrind

On Arch/Manjaro:

sudo pacman -S valgrind

On Windows: Valgrind is not available on Windows.

Compiling for Valgrind

Valgrind works on any binary, but the quality of its output depends heavily on what debugging information the binary contains. Always compile with -g to embed DWARF debug information. This is what lets Valgrind translate a raw instruction address like 0x401156 into a human-readable main (main.cpp:6).

You should also disable or reduce optimizations. With -O2 or -O3, the compiler aggressively reorders, inlines, and eliminates code. The resulting binary may not have a meaningful correspondence to source lines, and some bugs may be hidden because the optimizer happened to compile around them (or made them worse by reordering). Use -O0 for the clearest output:

g++ -g -O0 -o my_program my_program.cpp

If you need to test with optimizations enabled (for example, because a bug only manifests at -O2), use -O1 as a compromise - it enables some optimizations but keeps the binary reasonably debuggable.

Memcheck: Finding Memory Errors

Memcheck is the default Valgrind tool and the most widely used. It detects five categories of bugs:

Memory leaks - allocations that are never freed.
Invalid heap reads/writes - accessing memory outside an allocated block.
Use-after-free - accessing memory that has been freed.
Use of uninitialized memory - reading from a variable or allocation that was never written.
Double free - calling free() or delete on the same pointer twice.

Run Memcheck like this:

valgrind \
  --tool=memcheck \
  --leak-check=full \
  --show-leak-kinds=all \
  --track-origins=yes \
  --verbose \
  --error-exitcode=1 \
  ./my_program

Flag-by-flag breakdown:

--tool=memcheck - selects Memcheck explicitly (it is the default; included here for clarity).
--leak-check=full - after the program exits, report every individual leak with a full allocation stack trace. Without this, Valgrind only prints a summary count.
--show-leak-kinds=all - report all four categories of leaks (definite, indirect, possible, and still-reachable). The default omits "possibly lost" and "still reachable".
--track-origins=yes - when an uninitialized-value error fires, show where the uninitialized memory originated (which malloc call or stack frame). This roughly doubles Memcheck's memory overhead but is almost always worth it.
--verbose - print extra information, including the list of suppressed errors.
--error-exitcode=1 - makes Valgrind exit with a nonzero code if any errors were found, enabling CI integration.

How Memcheck Knows: Shadow Memory

Memcheck's ability to detect all these errors comes from a mechanism called shadow memory. For every byte in your program's address space, Memcheck maintains a corresponding set of bits in a parallel shadow region that tracks two properties:

Addressability (A-bits): Is this address currently valid to read or write? One bit per byte of program memory. If the A-bit is 0 ("not addressable"), any access to that byte triggers an "Invalid read/write" error.
Validity / Definedness (V-bits): Is the data at this address actually initialized with a meaningful value? One bit per bit of program memory (so 8 V-bits per byte). If you read a byte whose V-bits are "undefined" and then use that value in a way that affects observable behavior - a branch, a system call argument, etc. - Memcheck fires a "Conditional jump or move depends on uninitialised value(s)" error.

When you call malloc(n), Memcheck sets the A-bits for the returned block to "addressable" and the V-bits to "undefined" (because malloc does not zero-initialize). When you call free(p), the A-bits for that block flip back to "not addressable". Any subsequent access is immediately caught, even if the OS has not yet reclaimed or reused that memory (which is why use-after-free bugs so often escape the normal allocator - the OS does not immediately repossess the page).

Memcheck also maintains metadata about every active heap allocation: its base address, size, and the stack trace of where it was allocated. This is how it can report "Address 0x5204040 is 4 bytes after a block of size 8 alloc'd at ..." for out-of-bounds accesses.

Memory Leaks

A memory leak is an allocation whose memory can never be returned to the allocator because the program has lost all pointers to it. On long-running processes this causes memory usage to grow unboundedly. On short-lived programs the OS will reclaim everything on exit, so leaks are not always fatal - but they indicate a logic error and in larger programs they compound.

At program exit, Memcheck scans all of accessible memory for pointers to heap blocks. Based on what it finds, it classifies each outstanding allocation into one of four categories:

Definitely lost: No pointer to the start of the block exists anywhere in memory. The block is unreachable - you could not free it even if you wanted to. This is the clearest indicator of a real bug.
Indirectly lost: The block is only reachable through another block that is itself "definitely lost." Consider a linked list whose head pointer is leaked: the head node is "definitely lost," and every node reachable from the head is "indirectly lost." Fixing the definitely-lost block will also fix the indirectly-lost ones.
Possibly lost: An interior pointer exists - some pointer points into the middle of the block rather than at its start. This can happen with certain custom allocators or pointer arithmetic. It may be legitimate or it may indicate the pointer was accidentally advanced past the start of the block.
Still reachable: A pointer to the block exists at exit, but the block was never freed. This is common and often benign: C++ standard library implementations frequently allocate global state (string caches, locale objects, thread-local storage) that they intentionally never free, relying on the OS to reclaim memory at process exit. In your own code, static or global containers fall into this category. These are usually not real bugs, which is why --show-leak-kinds=all is needed to see them.

Here is a minimal program with a definite leak:

#include <cstdlib>
#include <cstdio>

int main() {
    // Allocate an array of 10 ints on the heap.
    int* arr = (int*)malloc(10 * sizeof(int));
    arr[0] = 42;

    // We return without calling free(arr).
    // Once main() returns, the pointer 'arr' goes out of scope.
    // The 40 bytes are now unreachable.
    return 0;
}

Compile and run under Valgrind:

g++ -g -O0 -o leak leak.cpp
valgrind --leak-check=full --show-leak-kinds=all ./leak

Output (abbreviated):

==18423== 40 bytes in 1 blocks are definitely lost in loss record 1 of 1
==18423==    at 0x484A840: malloc (vg_replace_malloc.c:442)
==18423==    at 0x401176: main (leak.cpp:6)
==18423==
==18423== LEAK SUMMARY:
==18423==    definitely lost: 40 bytes in 1 blocks
==18423==    indirectly lost: 0 bytes in 0 blocks
==18423==      possibly lost: 0 bytes in 0 blocks
==18423==    still reachable: 0 bytes in 0 blocks
==18423==         suppressed: 0 bytes in 0 blocks

The number in ==18423== is the process ID. Valgrind tells us exactly where the leaked allocation originated: line 6 of leak.cpp, the malloc call. With larger programs and deeper call stacks, this stack trace is what makes finding leaks tractable.

Indirect Leaks: An Example

To see indirect leaks, consider a simple singly-linked list where we forget to free the head:

#include <cstdlib>

struct Node {
    int val;
    Node* next;
};

int main() {
    // Build a list: 1 -> 2 -> 3 -> nullptr
    Node* head = (Node*)malloc(sizeof(Node));
    head->val  = 1;
    head->next = (Node*)malloc(sizeof(Node));
    head->next->val  = 2;
    head->next->next = (Node*)malloc(sizeof(Node));
    head->next->next->val  = 3;
    head->next->next->next = nullptr;

    // We never free any of the nodes.
    // 'head' goes out of scope on return.
    return 0;
}

Valgrind reports:

==18424== 16 bytes in 1 blocks are definitely lost in loss record 1 of 3
==18424==    at 0x484A840: malloc (vg_replace_malloc.c:442)
==18424==    at 0x401188: main (list.cpp:10)    <-- the head node
==18424==
==18424== 32 bytes in 2 blocks are indirectly lost in loss record 2 of 3
==18424==    at 0x484A840: malloc (vg_replace_malloc.c:442)
==18424==    at 0x4011B4: main (list.cpp:12)    <-- node 2
==18424==    ...
==18424==    at 0x4011D8: main (list.cpp:15)    <-- node 3

Only the head is "definitely lost"; the other two nodes are "indirectly lost" because they are only reachable through the leaked head. This classification is useful: you should focus on fixing the "definitely lost" allocations first. The indirect ones typically disappear automatically.

Invalid Heap Reads, Invalid Writes, and Use-After-Free

These three error types are arguably more dangerous than leaks because they cause undefined behavior with immediate consequences: silent data corruption, incorrect program output, or crashes that appear in completely unrelated code far from the actual bug site.

Heap Buffer Overflow

A heap buffer overflow occurs when you access memory beyond the end of an allocated block. The adjacent memory might belong to another allocation's header, another live object, or be unallocated. Writing to it silently corrupts state; reading from it produces garbage.

#include <cstdlib>
#include <cstdio>

int main() {
    int* buf = (int*)malloc(5 * sizeof(int));  // valid indices: 0..4

    for (int i = 0; i <= 5; i++) {  // BUG: i == 5 is one past the end
        buf[i] = i * 10;
    }

    printf("%d\n", buf[0]);
    free(buf);
    return 0;
}

Valgrind catches the out-of-bounds write at i == 5:

==19001== Invalid write of size 4
==19001==    at 0x401185: main (overflow.cpp:7)
==19001==  Address 0x5204054 is 0 bytes after a block of size 20 alloc'd
==19001==    at 0x484A840: malloc (vg_replace_malloc.c:442)
==19001==    at 0x401166: main (overflow.cpp:4)

Valgrind tells us: the invalid write is at overflow.cpp:7 (the buf[i] = i * 10 line when i == 5), the address is 0 bytes after the end of the block (so exactly one element off the end), and the block was allocated at overflow.cpp:4.

Note that Valgrind detects this on the loop iteration that causes the violation, not when the program crashes or produces wrong output. Without Valgrind, this bug might silently corrupt the allocator's internal bookkeeping, causing a crash inside free(buf) far from the actual cause.

Use-After-Free

Use-after-free (UAF) occurs when a pointer is used after the memory it points to has been freed. This is a particularly dangerous class of bug in long-running programs and a well-known attack vector: if an attacker can control what gets allocated in the freed region before it is used, they can potentially inject data into the program's execution.

#include <cstdlib>
#include <cstdio>

int main() {
    int* p = (int*)malloc(sizeof(int));
    *p = 42;

    free(p);       // p is now dangling

    *p = 99;       // BUG: write to freed memory
    printf("%d\n", *p);  // BUG: read from freed memory

    return 0;
}

==19002== Invalid write of size 4
==19002==    at 0x401185: main (uaf.cpp:9)
==19002==  Address 0x5204040 is 0 bytes inside a block of size 4 free'd
==19002==    at 0x484BB6F: free (vg_replace_malloc.c:872)
==19002==    at 0x401180: main (uaf.cpp:7)
==19002==  Block was alloc'd at
==19002==    at 0x484A840: malloc (vg_replace_malloc.c:442)
==19002==    at 0x401166: main (uaf.cpp:4)
==19002==
==19002== Invalid read of size 4
==19002==    at 0x40119A: main (uaf.cpp:10)
==19002==  Address 0x5204040 is 0 bytes inside a block of size 4 free'd
  ...

The report includes three stack traces: where the invalid access happened (line 9), where the memory was freed (line 7), and where it was originally allocated (line 4). This three-way linkage is essential for diagnosing UAF bugs in large codebases where allocation, free, and use can span multiple files and hundreds of lines.

Importantly, Valgrind catches this error even though the OS has not yet reclaimed the page. In practice, a UAF bug might "work" under the normal allocator because the freed block has been put back into the allocator's free list but the page is still mapped and writable. Valgrind's A-bit mechanism fires regardless of OS state.

Use of Uninitialized Memory

An uninitialized read is a subtler class of bug. The C standard does not define what value an uninitialized variable holds; in practice it is whatever happened to be in that register or stack slot at the time. Reading it doesn't immediately crash anything - the danger is that the garbage value propagates through computations and eventually drives a branch or system call, producing wrong results that are hard to trace back.

#include <cstdlib>
#include <cstdio>

int main() {
  int* arr = (int*)malloc(5 * sizeof(int));
  // We initialize arr[0..2] but forget arr[3] and arr[4].
  arr[0] = 10; arr[1] = 20; arr[2] = 30;

  int sum = 0;
  for (int i = 0; i < 5; i++) {
      sum += arr[i];  // arr[3] and arr[4] are uninitialized
  }

  // 'sum' is now tainted by undefined values.
  if (sum > 100) {  // BUG: branch depends on uninitialized data
      printf("large\n");
  }

  free(arr);
  return 0;
}

==19003== Conditional jump or move depends on uninitialised value(s)
==19003==    at 0x4011B2: main (uninit.cpp:15)
==19003==  Uninitialised value was created by a heap allocation
==19003==    at 0x484A840: malloc (vg_replace_malloc.c:442)
==19003==    at 0x401166: main (uninit.cpp:4)

Several things are worth noting here. First, Memcheck does not fire when we read arr[3] into sum. It fires when we use sum in a branch on line 15. This is intentional: Memcheck propagates "taint" (undefined V-bits) lazily through arithmetic. Only when undefined bits reach a decision point that affects control flow or a syscall argument does Valgrind report the error. This means the reported line may not be the root cause - you need to trace the data flow back to the malloc on line 4.

Second, with --track-origins=yes, Valgrind tells us the uninitialized value "was created by a heap allocation at uninit.cpp:4." Without this flag, Valgrind can only tell us that a conditional depends on an uninitialized value; with it, we get the origin of the uninitialized memory.

Note: calloc() zero-initializes its memory, so V-bits start out "defined." malloc() does not. Stack variables in C are similarly uninitialized; Memcheck tracks those too, though it cannot always pinpoint stack allocations with the same precision as heap ones.

Double Free and Mismatched Deallocation

A double free occurs when free() is called on the same pointer twice. After the first free, the block is returned to the allocator's free list. The second free corrupts the allocator's internal state. In glibc, this often results in an abort with "double free or corruption," but that detection is not guaranteed and can be bypassed.

#include <cstdlib>
  
int main() {
  int* p = (int*)malloc(sizeof(int));
  *p = 1;
  free(p);
  free(p);  // BUG: double free
  return 0;
}

==19004== Invalid free() / delete / delete[] / realloc()
==19004==    at 0x484BB6F: free (vg_replace_malloc.c:872)
==19004==    at 0x401180: main (dfree.cpp:7)    <-- second free
==19004==  Address 0x5204040 is 0 bytes inside a block of size 4 free'd
==19004==    at 0x484BB6F: free (vg_replace_malloc.c:872)
==19004==    at 0x40117A: main (dfree.cpp:6)    <-- first free
==19004==  Block was alloc'd at
==19004==    at 0x484A840: malloc (vg_replace_malloc.c:442)
==19004==    at 0x401166: main (dfree.cpp:4)    <-- original malloc

Memcheck also catches mismatched allocator/deallocator pairs - for example, allocating with new and freeing with free(), or allocating an array with new[] and deleting it with a scalar delete. These are undefined behavior in C++ even if they "work" on a given implementation.

int* p = new int[10];
delete p;   // BUG: should be delete[]

==19005== Mismatched free() / delete / delete []
==19005==    at 0x484C5C4: operator delete(void*, unsigned long) ...
==19005==    at 0x401181: main (mismatch.cpp:3)
==19005==  Address 0x5204040 is 0 bytes inside a block of size 40 alloc'd
==19005==    at 0x484D1F2: operator new[](unsigned long) ...
==19005==    at 0x401170: main (mismatch.cpp:2)

Callgrind: Call-Graph and Cache Profiling

Once a program is correct, the next question is often: where is it slow? Callgrind collects detailed execution statistics - instruction counts, cache simulation results, and a full call graph - without any changes to the binary.

Run Callgrind like this:

valgrind --tool=callgrind \
  --cache-sim=yes \
  --branch-sim=yes \
  ./my_program
# Produces: callgrind.out.<pid>

The output file contains raw event counts for every function and every call edge in the program. The events Callgrind reports include:

Ir (Instruction reads): the number of instructions executed. This is the primary measure of "work done" in the CPU.
Dr / Dw (Data reads / writes): memory load and store operations.
I1mr / D1mr / D1mw: L1 instruction-cache misses, L1 data-cache read misses, L1 data-cache write misses.
ILmr / DLmr / DLmw: last-level cache (LLC) misses. LLC misses are the expensive ones - they go to DRAM.
Bcm / Bim (with --branch-sim=yes): branch condition mispredictions and indirect branch mispredictions.

Callgrind does not measure wall-clock time. It measures instruction counts, which are a proxy for CPU time for compute-bound code. For I/O-bound or synchronization-heavy code, instruction counts are less meaningful - use perf or gprof with real-time sampling instead.

Visualizing with KCacheGrind

The raw output file is not human-readable. The standard tool for visualizing Callgrind output is KCacheGrind (Linux, part of KDE) or QCacheGrind (cross-platform). Install it with:

sudo apt install kcachegrind   # Debian/Ubuntu
sudo pacman -S kcachegrind     # Arch

kcachegrind callgrind.out.12345

KCacheGrind shows a flat list of functions sorted by cost. Each function has two cost columns:

Self cost: instructions (or cache misses) spent in the function itself, excluding time spent in callees.
Inclusive cost: total instructions in the function and all of its callees. A function with high inclusive cost but low self cost is a hotspot's caller, not the hotspot itself.

KCacheGrind also shows the call graph: which functions call which, and how many instructions were executed on each call edge. This is invaluable for understanding whether a function is expensive because it is called millions of times with cheap bodies, or because individual calls are expensive.

Command-Line Annotation

If you prefer staying in the terminal, callgrind_annotate produces a source-annotated listing:

callgrind_annotate --auto=yes callgrind.out.12345

Each source line is annotated with the instruction count it contributes. This makes it easy to identify the exact loop body or expression that is consuming most of the instructions.

Profiling a Specific Region

Sometimes you only want to profile a specific function or section of code, not the entire program. Callgrind supports starting and stopping collection under program control via the CALLGRIND_START_INSTRUMENTATION and CALLGRIND_STOP_INSTRUMENTATION macros (from <valgrind/callgrind.h>), or interactively via callgrind_control:

# Start Valgrind with instrumentation off
valgrind --tool=callgrind --instr-atstart=no ./my_program

# In another terminal, turn on instrumentation for PID 12345
callgrind_control -i on 12345

# Later, turn it off
callgrind_control -i off 12345

Massif: Heap Memory Profiling

Memcheck tells you where memory is leaked; Massif tells you where memory is used. It answers the question: at the moment my program's heap peaked, what allocations were responsible, and what call chains produced them?

valgrind --tool=massif \
  --pages-as-heap=no \
  --time-unit=ms \
  ./my_program
  # Produces: massif.out.<pid>

Key options:

--time-unit=ms - use wall-clock milliseconds as the x-axis in the profile. Alternatives are i (instructions, the default) and B (bytes allocated). Wall-clock time is the most intuitive for understanding heap growth over real time.
--pages-as-heap=yes - profile all memory the program touches, not just malloc/new. This includes the text segment, stack, and mmapped files. Useful when hunting for total-process RSS growth rather than just heap growth.
--detailed-freq=N - take a "detailed" snapshot (with full allocation trees) every N snapshots. The default is 10.

Reading ms_print Output

ms_print massif.out.12345 | less

The output has two parts. The first is an ASCII graph of heap usage over time, with a @ marker at each detailed snapshot and a : at normal snapshots. The peak usage is marked with a #. For example:

 MB
  4.00 ^                                              #
       |                                           @@#@
       |                                        @@@@@#@
       |                                    @@@@@@@@#@@
       |                              @@@@@@@@@@@@#@@@
  0.00 +--------------------------------------------->ms
       0                                            500

The second part lists each detailed snapshot. For each one, it shows the call tree of all live allocations - which call path allocated how many bytes:

--------------------------------------------------------------------------------
  n        time(ms)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  10         120.0          2,097,152      2,097,152             0              0
98.97% (2,097,152B) (heap allocation functions) malloc/new/new[]
->79.46% (1,666,048B) 0x401223: build_matrix (matrix.cpp:45)
->79.46% (1,666,048B) 0x401155: main (main.cpp:12)
->19.51% (409,600B)  0x4013AB: load_data (loader.cpp:88)
->19.51% (409,600B)  0x401155: main (main.cpp:8)

This tells us that at the peak snapshot, 79% of heap memory was allocated by build_matrix() (called from main on line 12) and 19% by load_data(). If we need to reduce peak memory usage, build_matrix is the place to start.

Massif is particularly useful when investigating reports of a program consuming "too much memory." RSS (resident set size, as reported by top) is an unreliable proxy - it includes shared libraries, the stack, and can be inflated by fragmentation. Massif gives you the exact heap bytes your code is responsible for, attributed to specific call sites.

Appendix

Suppression Files

Real programs link against third-party libraries - OpenSSL, glibc, Qt, whatever - whose internals Valgrind will also instrument. Many of these libraries have intentional behaviors that look like errors to Memcheck: they allocate global state they never free, use memory in non-standard ways, or have race conditions Helgrind flags as harmless. These produce false positives that drown out real errors.

A suppression file lists patterns of error reports to ignore. Valgrind ships with suppression files for common system libraries; you can specify additional ones:

valgrind --suppressions=/usr/lib/valgrind/debian.supp \
         --suppressions=./my_project.supp \
         ./my_program

To generate a suppression entry for an error you want to silence, run Valgrind with --gen-suppressions=all. It will print a suppression block after each error that you can copy into your .supp file. A suppression entry looks like:

{
   ignore_openssl_init_leak
   Memcheck:Leak
   match-leak-kinds: reachable
   fun:malloc
   ...
   fun:OPENSSL_init_ssl
}

Valgrind vs. AddressSanitizer

Valgrind and AddressSanitizer (ASan) both detect many of the same bugs but with different trade-offs:

Property	Valgrind / Memcheck	AddressSanitizer
Requires recompilation	No	Yes (`-fsanitize=address`)
Runtime overhead	~10–50x	~2–3x
Stack buffer overflows	Limited	Yes (with `-O1` or higher)
Heap buffer overflows	Yes	Yes
Use-after-free	Yes	Yes
Memory leaks	Yes (detailed)	Yes (with LeakSanitizer)
Uninitialized reads	Yes	No (use MemorySanitizer instead)
Works on prebuilt binaries	Yes	No

In practice, use ASan for your development build (it is fast enough to leave on by default) and Valgrind for auditing third-party libraries or when you need to find uninitialized-read bugs without access to source.

Quick Reference

Tool	Purpose	Key flags
Memcheck	Memory errors and leaks	`--leak-check=full --track-origins=yes`
Callgrind	Call-graph and cache profiling	`--cache-sim=yes --branch-sim=yes`
Massif	Heap memory usage over time	`--time-unit=ms --pages-as-heap=no`
Helgrind	POSIX thread race conditions	(no special flags needed)
DRD	Thread error detection	`--check-stack-var=yes`
Cachegrind	Cache and branch-prediction simulation	`--branch-sim=yes`

References

Valgrind User Manual - valgrind.org/docs/manual/manual.html
Memcheck: a memory error detector - valgrind.org/docs/manual/mc-manual.html
Callgrind: a call-graph generating cache and branch prediction profiler - valgrind.org/docs/manual/cl-manual.html
Massif: a heap profiler - valgrind.org/docs/manual/ms-manual.html
Seward, J., Nethercote, N. (2005). Using Valgrind to detect undefined value errors with bit-precision. USENIX Annual Technical Conference.