At Codemancers, we're building Rbkit, a fresh code profiler for the Ruby language with tonnes of cool features. I'm currently working on implementing a CPU profiler inside rbkit gem which would help rbkit UI to reconstruct the call graph of the profiled ruby process and draw useful visualizations on the screen. I learned a bunch of new things along the way and I'd love to share them with you in this series of blog posts.
We're going to start from the fundamentals and step by step, we're going to write a rudimentary CPU profiler for Ruby ourselves! By the time we finish we would learn :
- What's CPU profiling
- Profiling modes - Instrumentation and Sampling
- CPU time and Wall time - what they mean and how to measure them
- Writing a simple C extension and using it in Ruby land
- Ruby tracepoints -
call
andreturn
- Signal handling in C
- Pausing Ruby process using a signal and peeking at the call stack
- Some useful and silly experiments using profiling data
Part I. An Introduction to CPU Profiling
By doing a CPU profile of your program, you can find out how expensive your program is with respect to CPU usage. In order to profile your program, you'll need to use a profiling tool and follow the following steps :
- Start CPU profiling
- Execute the code you want to profile
- Stop CPU profiling and get profiling result
- Analyze result
By analyzing the profiling result, you can find the bottlenecks which slow down your whole program.
Profiling modes
CPU profiling is done in broadly 2 ways:
1. Instrumentation
In this mode, the profiling tool makes use of some hooks, either provided by the interpreter or inserted into the program, to understand the call graph and measure the execution time of each method in the call graph.
For example, consider the following piece of Ruby code :
I've inserted some comments to help understand how the hooks would get executed if the Ruby interpreter gave us method call and return hooks:
Now if we could print the current time and the name of the current method inside these hooks, we'd get an output which looks somewhat like this :
sec:00 usec:201007 called main
sec:00 usec:201108 called find_many_square_roots
sec:00 usec:692123 returned find_many_square_roots
sec:00 usec:692178 called find_many_squares
sec:00 usec:846540 returned find_many_squares
sec:00 usec:846594 called find_many_square_roots
sec:01 usec:336166 returned find_many_square_roots
sec:01 usec:336215 called find_many_squares
sec:01 usec:484880 returned find_many_squares
sec:01 usec:484945 called find_many_square_roots
sec:01 usec:959254 returned find_many_square_roots
sec:01 usec:959315 called find_many_squares
sec:02 usec:106474 returned find_many_squares
sec:02 usec:106526 returned main
As you can see, this output can tell us how much time was spent inside each method. It also tells us how many times each method was called. This is roughly how instrumentation profiling works.
Pros :
- High accuracy
- We get method invocation count
- Easy to implement
Cons :
- Extra overhead of executing hooks for each and every method that's profiled
2. Sampling
In this mode of profiling, the profiler interrupts the program execution once in every x unit of time, and takes a peek into the call stack and records what it sees(called a "sample"). Once the program finishes running, the profiler collects all the samples and finds out the number of times each method appears across all the samples.
Hard to visualize? Let's look at the same example code and see how different the output would be, if we used a sampling profiler.
Output from a sampling profiler would look like this :
Call stack at 0.5sec: main/find_many_square_roots
Call stack at 1.0sec: main/find_many_square_roots
Call stack at 1.5sec: main/find_many_square_roots
Call stack at 2.0sec: main/find_many_squares
In this example, the process was interrupted every 0.5 second and the call stack
was recorded. Thus we got 4 samples over the lifetime of the program and out of
those 4 samples, find_many_square_roots
is present in 3 and find_many_squares
is present in only one sample. From this sampling, we say that find_many_square_roots
took 75% of the CPU where as find_many_squares
took only 25%. That's roughly
how sampling profilers work.
Pros :
- Negligible overhead compared to instrumentation profiling
- Easy to find slow/long running methods
Cons :
- Not good at measuring short running methods
- We don't get method invocation count
- Not easy to write a sampling profiler on your own
Recap
We just looked into what CPU profiling means and the 2 common strategies of CPU profiling. In part 2, we'll explore the 2 units of measuring CPU usage - CPU time and Wall time. We'll also get our hands dirty and write some code to get these measurements. Thanks for reading!