diff --git a/lectures/need_for_speed.md b/lectures/need_for_speed.md index ef4f23ad..65ad6f2c 100644 --- a/lectures/need_for_speed.md +++ b/lectures/need_for_speed.md @@ -27,7 +27,7 @@ premature optimization is the root of all evil." -- Donald Knuth ## Overview -It's probably safe to say that Python is the most popular language for scientific computing. +Python is the most popular language for many aspects of scientific computing. This is due to @@ -74,29 +74,30 @@ Let's briefly review Python's scientific libraries. ### Why do we need them? -One reason we use scientific libraries is because they implement routines we want to use. +We need Python's scientific libraries for two reasons: -* numerical integration, interpolation, linear algebra, root finding, etc. +1. Python is small +2. Python is slow + +**Python in small** -For example, it's usually better to use an existing routine for root finding than to write a new one from scratch. +Core python is small by design -- this helps with optimization, accessibility, and maintenance -(For standard algorithms, efficiency is maximized if the community can -coordinate on a common set of implementations, written by experts and tuned by -users to be as fast and robust as possible!) +Scientific libraries provide the routines we don't want to -- and probably shouldn't -- write oursives -But this is not the only reason that we use Python's scientific libraries. +* numerical integration, interpolation, linear algebra, root finding, etc. -Another is that pure Python is not fast. +**Python is slow** -So we need libraries that are designed to accelerate execution of Python code. +Another reason we need the scientific libraries is that pure Python is relatively slow. -They do this using two strategies: +Scientific libraries accelerate execution using three main strategies: -1. using compilers that convert Python-like statements into fast machine code for individual threads of logic and -2. parallelizing tasks across multiple "workers" (e.g., CPUs, individual threads inside GPUs). +1. Vectorization: providing compiled machine code and interfaces that make this code accessible +1. JIT compilation: compilers that convert Python-like statements into fast machine code at runtime +2. Parallelization: Shifting tasks across multiple threads/ CPUs / GPUs /TPUs -We will discuss these ideas extensively in this and the remaining lectures from -this series. +We will discuss these ideas in depth below. ### Python's Scientific Ecosystem @@ -123,7 +124,7 @@ Here's how they fit together: * Pandas provides types and functions for manipulating data. * Numba provides a just-in-time compiler that plays well with NumPy and helps accelerate Python code. -We will discuss all of these libraries extensively in this lecture series. +We will discuss all of these libraries at length in this lecture series. ## Pure Python is slow @@ -189,15 +190,13 @@ a, b = ['foo'], ['bar'] a + b ``` -(We say that the operator `+` is *overloaded* --- its action depends on the -type of the objects on which it acts) -As a result, when executing `a + b`, Python must first check the type of the objects and then call the correct operation. +As a result, when executing `a + b`, Python must first check the type of the +objects and then call the correct operation. -This involves a nontrivial overhead. +This involves overhead. -If we repeatedly execute this expression in a tight loop, the nontrivial -overhead becomes a large overhead. +If we repeatedly execute this expression in a tight loop, the overhead becomes large. #### Static types @@ -243,38 +242,29 @@ To illustrate, let's consider the problem of summing some data --- say, a collec #### Summing with Compiled Code -In C or Fortran, these integers would typically be stored in an array, which -is a simple data structure for storing homogeneous data. +In C or Fortran, an array of integers is stored in a single contiguous block of memory -Such an array is stored in a single contiguous block of memory - -* In modern computers, memory addresses are allocated to each byte (one byte = 8 bits). * For example, a 64 bit integer is stored in 8 bytes of memory. * An array of $n$ such integers occupies $8n$ *consecutive* memory slots. -Moreover, the compiler is made aware of the data type by the programmer. - -* In this case 64 bit integers +Moreover, the data type is known at compile time. Hence, each successive data point can be accessed by shifting forward in memory space by a known and fixed amount. -* In this case 8 bytes #### Summing in Pure Python Python tries to replicate these ideas to some degree. -For example, in the standard Python implementation (CPython), list elements are placed in memory locations that are in a sense contiguous. +For example, in the standard Python implementation (CPython), list elements are +placed in memory locations that are in a sense contiguous. However, these list elements are more like pointers to data rather than actual data. Hence, there is still overhead involved in accessing the data values themselves. -This is a considerable drag on speed. - -In fact, it's generally true that memory traffic is a major culprit when it comes to slow execution. - +Such overhead is a major culprit when it comes to slow execution. ### Summary @@ -295,15 +285,11 @@ synonymous with parallelization. This task is best left to specialized compilers! -Certain Python libraries have outstanding capabilities for parallelizing scientific code -- we'll discuss this more as we go along. - - ## Accelerating Python -In this section we look at three related techniques for accelerating Python -code. +In this section we look at three related techniques for accelerating Python code. Here we'll focus on the fundamental ideas. @@ -325,10 +311,11 @@ Many economists usually refer to array programming as "vectorization." In computer science, this term has [a slightly different meaning](https://en.wikipedia.org/wiki/Automatic_vectorization). ``` -The key idea is to send array processing operations in batch to pre-compiled -and efficient native machine code. +The key idea is to send array processing operations in batch to pre-compiled and +efficient native machine code. -The machine code itself is typically compiled from carefully optimized C or Fortran. +The machine code itself is typically compiled from carefully optimized C or +Fortran. For example, when working in a high level language, the operation of inverting a large matrix can be subcontracted to efficient machine code that is pre-compiled @@ -346,6 +333,7 @@ The idea of vectorization dates back to MATLAB, which uses vectorization extensi ```{figure} /_static/lecture_specific/need_for_speed/matlab.png ``` +NumPy uses a similar model, inspired by MATLAB ### Vectorization vs for pure Python loops @@ -423,19 +411,17 @@ can be run) has slowed dramatically in recent years. Chip designers and computer programmers have responded to the slowdown by seeking a different path to fast execution: parallelization. -Hardware makers have increased the number of cores (physical CPUs) embedded in each machine. +This involves -For programmers, the challenge has been to exploit these multiple CPUs by -running many processes in parallel (i.e., simultaneously). +1. increasing the number of CPUs embedded in each machine +1. connecting hardware accelerators such as GPUs and TPUs -This is particularly important in scientific programming, which requires handling - -* large amounts of data and -* CPU intensive simulations and other calculations. +For programmers, the challenge has been to exploit this hardware +running many processes in parallel. Below we discuss parallelization for scientific computing, with a focus on -1. the best tools for parallelization in Python and +1. tools for parallelization in Python and 1. how these tools can be applied to quantitative economic problems. @@ -447,22 +433,18 @@ scientific computing and discuss their pros and cons. #### Multiprocessing -Multiprocessing means concurrent execution of multiple processes using more than one processor. - -In this context, a **process** is a chain of instructions (i.e., a program). +Multiprocessing means concurrent execution of multiple threads of logic using more than one processor. Multiprocessing can be carried out on one machine with multiple CPUs or on a -collection of machines connected by a network. +cluster of machines connected by a network. -In the latter case, the collection of machines is usually called a -**cluster**. +With multiprocessing, *each process has its own memory space*, although the physical memory chip might be shared. -With multiprocessing, each process has its own memory space, although the -physical memory chip might be shared. #### Multithreading -Multithreading is similar to multiprocessing, except that, during execution, the threads all share the same memory space. +Multithreading is similar to multiprocessing, except that, during execution, the +threads all *share the same memory space*. Native Python struggles to implement multithreading due to some [legacy design features](https://wiki.python.org/moin/GlobalInterpreterLock). @@ -472,6 +454,7 @@ But this is not a restriction for scientific libraries like NumPy and Numba. Functions imported from these libraries and JIT-compiled code run in low level execution environments where Python's legacy restrictions don't apply. + #### Advantages and Disadvantages Multithreading is more lightweight because most system and memory resources diff --git a/lectures/numba.md b/lectures/numba.md index 6668e6c4..a39010da 100644 --- a/lectures/numba.md +++ b/lectures/numba.md @@ -48,11 +48,13 @@ import matplotlib.pyplot as plt In an {doc}`earlier lecture ` we discussed vectorization, which can improve execution speed by sending array processing operations in batch to efficient low-level code. -However, as {ref}`discussed previously `, traditional vectorization schemes, such as those found in MATLAB, Julia, and NumPy, have several weaknesses. +However, as {ref}`discussed in that lecture `, +traditional vectorization schemes, such as those found in MATLAB, Julia, and NumPy, have weaknesses. -For example, they can be highly memory-intensive and, for some algorithms, vectorization is ineffective or impossible. +* Highly memory-intensive for compound array operations +* Ineffective or impossible for some algorithms. -One way around these problems is through [Numba](https://numba.pydata.org/), a +One way to circumvent these problems is by using [Numba](https://numba.pydata.org/), a **just in time (JIT) compiler** for Python that is oriented towards numerical work. Numba compiles functions to native machine code instructions during runtime. @@ -62,6 +64,14 @@ When it succeeds, Numba will be on par with machine code from low-level language In addition, Numba can do other useful tricks, such as {ref}`multithreading` or interfacing with GPUs (through `numba.cuda`). +Numba's JIT compiler is in many ways similar to the JIT compiler in JULIA + +The main difference is that it is less ambitious, attempting to compile a smaller subset of the language. + +Although this might sound like a defficiency, it is in some ways an advantage. + +Numba is lean, easy to use, and very good at what it does. + This lecture introduces the core ideas. (numba_link)= @@ -74,10 +84,10 @@ This lecture introduces the core ideas. (quad_map_eg)= ### An Example -Let's consider a problem that's difficult to vectorize: generating the -trajectory of a difference equation given an initial condition. +Let's consider a problem that's difficult to vectorize (i.e., hand off to array +processing operations). -We will take the difference equation to be the quadratic map +The problem involves generating the trajectory via the quadratic map $$ x_{t+1} = \alpha x_t (1 - x_t) @@ -168,22 +178,23 @@ The basic idea is this: * Python is very flexible and hence we could call the function qm with many types. * e.g., `x0` could be a NumPy array or a list, `n` could be an integer or a float, etc. -* This makes it hard to *pre*-compile the function (i.e., compile before runtime). -* However, when we do actually call the function, say by running `qm(0.5, 10)`, - the types of `x0` and `n` become clear. +* This makes it very difficult to generate efficient machine code *ahead of time* (i.e., before runtime). +* However, when we do actually *call* the function, say by running `qm(0.5, 10)`, + the types of `x0` and `n` become clear. * Moreover, the types of *other variables* in `qm` *can be inferred once the input types are known*. -* So the strategy of Numba and other JIT compilers is to wait until this - moment, and then compile the function. +* So the strategy of Numba and other JIT compilers is to *wait until the function is called*, and then compile. -That's why it is called "just-in-time" compilation. +That's is called "just-in-time" compilation. -Note that, if you make the call `qm(0.5, 10)` and then follow it with `qm(0.9, 20)`, compilation only takes place on the first call. +Note that, if you make the call `qm(0.5, 10)` and then follow it with `qm(0.9, +20)`, compilation only takes place on the first call. This is because compiled code is cached and reused as required. This is why, in the code above, `time3` is smaller than `time2`. + ## Decorator Notation In the code above we created a JIT compiled version of `qm` via the call @@ -226,30 +237,22 @@ with qe.Timer(precision=4): Numba also provides several arguments for decorators to accelerate computation and cache functions -- see [here](https://numba.readthedocs.io/en/stable/user/performance-tips.html). + ## Type Inference Successful type inference is a key part of JIT compilation. -As you can imagine, inferring types is easier for simple Python objects (e.g., simple scalar data types such as floats and integers). +As you can imagine, inferring types is easier for simple Python objects (e.g., +simple scalar data types such as floats and integers). Numba also plays well with NumPy arrays, which have well-defined types. In an ideal setting, Numba can infer all necessary type information. -This allows it to generate native machine code, without having to call the Python runtime environment. +This allows it to generate efficient native machine code, without having to call the Python runtime environment. When Numba cannot infer all type information, it will raise an error. -```{note} -In older versions of Numba, the `@jit` decorator would silently fall back -to "object mode" when it could not infer all types, which provided little or -no speed gain. Current versions of Numba use `nopython` mode by default, -meaning the compiler insists on full type inference and raises an error if -it fails. You will often see `@njit` used in other code, which is simply -an alias for `@jit(nopython=True)`. Since nopython mode is now the default, -`@jit` and `@njit` are equivalent. -``` - For example, in the (artificial) setting below, Numba is unable to determine the type of function `mean` when compiling the function `bootstrap` ```{code-cell} ipython3 @@ -297,9 +300,7 @@ Let's add some cautionary notes. As we've seen, Numba needs to infer type information on all variables to generate fast machine-level instructions. -For simple routines, Numba infers types very well. - -For larger ones, or for routines using external libraries, it can easily fail. +For large routines or those using external libraries, this process can easily fail. Hence, it's best to focus on speeding up small, time-critical snippets of code. @@ -333,32 +334,14 @@ function. When Numba compiles machine code for functions, it treats global variables as constants to ensure type stability. -### Caching Compiled Code - -By default, Numba recompiles functions each time a new Python session starts. - -To avoid this overhead, you can pass `cache=True` to the decorator: - -```{code-cell} ipython3 -@jit(cache=True) -def qm(x0, n): - x = np.empty(n+1) - x[0] = x0 - for t in range(n): - x[t+1] = α * x[t] * (1 - x[t]) - return x -``` - -This stores the compiled code on disk so that subsequent sessions can skip -the compilation step. (multithreading)= ## Multithreaded Loops in Numba -In addition to JIT compilation, Numba provides support for parallel computing on CPUs. +In addition to JIT compilation, Numba provides support for parallel computing on CPUs and GPUs. -The key tool for parallelization in Numba is the `prange` function, which tells -Numba to execute loop iterations in parallel across available CPU cores. +The key tool for parallelization on CPUs in Numba is the `prange` function, which tells +Numba to execute loop iterations in parallel across available cores. To illustrate, let's look first at a simple, single-threaded (i.e., non-parallelized) piece of code. @@ -418,27 +401,10 @@ Now let's suppose that we have a large population of households and we want to know what median wealth will be. This is not easy to solve with pencil and paper, so we will use simulation -instead. - -In particular, we will simulate a large number of households and then -calculate median wealth for this group. - -Suppose we are interested in the long-run average of this median over time. - -For the specification that we've chosen above, we can -calculate this by taking a one-period cross-sectional snapshot of median -wealth of the group at the end of a long simulation. - -Moreover, provided the simulation period is long enough, initial conditions don't matter. - -(This is due to [ergodicity](https://python.quantecon.org/finite_markov.html#id15).) - -So, in summary, we are going to simulate 50,000 households by - -1. arbitrarily setting initial wealth to 1 and -1. simulating forward in time for 1,000 periods. +instead: -Then we'll calculate median wealth at the end period. +1. Simulate a large number of households forward in time +2. Calculate median wealth Here's the code: @@ -492,6 +458,8 @@ with qe.Timer(): The speed-up is significant. +Notice that we parallelize across households rather than over time -- updates of +an individual household across time periods are inherently sequential ## Exercises @@ -550,8 +518,9 @@ So we get a speed gain of 2 orders of magnitude by adding four characters. :label: speed_ex2 ``` -In the [Introduction to Quantitative Economics with Python](https://intro.quantecon.org/intro.html) lecture series you can -learn all about finite-state Markov chains. +In the [Introduction to Quantitative Economics with +Python](https://intro.quantecon.org/intro.html) lecture series you can learn all +about finite-state Markov chains. For now, let's just concentrate on simulating a very simple example of such a chain.