folio/pythonhpc.qmd at main · WestDRI/folio · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Part 1: towards high-performance Python"
aliases:
  - /pythonhpc
---

**Start on November 20<sup>th</sup>, continue on November 27<sup>th</sup>, 10:00am–12:00pm Pacific Time**

**Abstract**: Python has become the dominant language in scientific computing thanks to its high-level syntax,
extensive ecosystem, and ease of use. However, its performance often lags behind traditional compiled
languages like C, C++, and Fortran, as well as newer contenders like Julia and Chapel. This course is designed
to help you speed up your Python workflows using a variety of tools and techniques.

We'll begin with classic optimization methods such as NumPy-based vectorization, and explore just-in-time
compilation using Numba, along with performance profiling techniques. From there, we'll delve into
parallelization -- starting with multithreading using external libraries like NumExpr and Python 3.13's new
free-threading capabilities -- but placing greater emphasis on multiprocessing.

Next, we'll dive into Ray, a powerful and flexible framework for scaling Python applications. While Ray is
widely used in AI, our focus will be on its core capabilities for distributed computing and data
processing. You'll learn how to parallelize CPU-bound numerical workflows -- with and without reduction -- as
well as optimize I/O-bound tasks. We'll also explore combining Ray with Numba and will discuss coding tightly
coupled parallel problems.

Please note: this course does not cover GPU computing (which merits its own course), nor will we dive into
`mpi4py`, the standard MPI library for Python.

::: {.callout-note}
In these notes all timings are for a 2021 Macbook Pro equipped with the M1 Pro chip and 16 GB of memory. On
other machines, including our current training cluster, the timings will be different. They can also vary
quite a bit from one training cluster to another, depending on the virtual machine (VM) flavours and the
underlying physical cores allocated to run these VMs. What's important is the relative timing of one code
vs. another when run on the same machine under the same conditions.
:::


## Installation

Today we'll be running Python on the training cluster. If instead you want to run everything locally on your
computer, you can install Python and the libraries, but the installation instructions will vary depending on
how you typically install Python, and whether/how you use Python virtual environments. Here is what I did on
my computer (with `uv` installed earlier):

```sh
uv venv ~/env-hpc --python 3.12   # create a new virtual environment
source ~/env-hpc/bin/activate
uv pip install numpy
uv pip install --upgrade "ray[default]"
uv pip install --upgrade "ray[data]"
uv pip install tqdm netcdf4 scipy numexpr psutil multiprocess numba scalene Pillow
uv pip list                       # show installed packages
...
deactivate
```

<!-- # pip uninstall pandas -->
<!-- # pip install -Iv pandas==2.1.4 -->

On a production HPC cluster:

::: {.callout-note}
Don't do this on the training cluster: I already installed all these packages centrally for all users.
:::

```sh
module load python/3.12.4 arrow/19.0.1 scipy-stack/2025a netcdf/4.9.2
virtualenv --no-download env-hpc
source env-hpc/bin/activate
pip install --no-index --upgrade pip
pip install --no-index numba multiprocess numexpr
avail_wheels "ray"
pip install --no-index ray tqdm scalene grpcio
pip install modin
...
deactivate
```

<!-- In this course we will be doing all work on our training cluster on which we have already installed Python and -->
<!-- all the necessary libraries. -->


## Introduction

Python pros                                 | Python cons
--------------------------------------------|------------------------
elegant scripting language                  | slow (interpreted, dynamically typed)
easy/fast to develop and read code          | uses indentation for code blocks
powerful, compact constructs for many tasks |
huge number of external libraries           |
very popular across all fields              |

::: {.callout-note}
##### Python is interpreted:
- translation to machine code happens line-by-line
- no cross-line optimization
:::

::: {.callout-note}
##### Python is dynamically typed:
- since Python's data collections can be very diverse, types are part of the data &nbsp;⇒&nbsp; significant overhead
- automatic memory management: on-the-fly reference counting, garbage collection, range checking &nbsp;⇒&nbsp; slow
:::

All these high-level features make Python slow. In this course we will concentrate on getting the most
performance out of it using various technologies.

![](/fig/technologies.png)


## Python setup in our course

Today we'll be running Python inside a shell on our training cluster `python.vastcloud.org`. Let's log in now!

We have pre-installed all the required libraries for you in a virtual Python environment in
`/project/def-sponsor00/shared/env-hpc` that everyone on the system can read.

Once on the system, our workflow is going to be:

```sh
mkdir -p ~/tmp && cd ~/tmp
module load python/3.12.4 arrow/19.0.1 scipy-stack/2025a netcdf/4.9.2
source /project/def-sponsor00/shared/env-hpc/bin/activate

# our default mode
salloc --time=2:00:0 --mem-per-cpu=3600
...
exit

# more memory needed for several serial examples
salloc --time=2:00:0 --mem-per-cpu=12000
...
exit

# sometimes we'll use 4 cores
salloc --cpus-per-task=4 --time=2:00:0 --mem-per-cpu=3600
...
exit

# very similar
salloc --ntasks=4 --nodes=1 --time=2:00:0 --mem-per-cpu=3600
...
exit

# might run across multiple nodes
salloc --ntasks=4 --time=2:00:0 --mem-per-cpu=3600
...
exit

deactivate
```

Please do not run Python on the login node, as it'll make the login node slow for everyone and might
potentially crash it.

Finally, to monitor CPU usage inside a Slurm job, run the following from the login node (this will connect to
the running job and launch `htop` inside it):

```sh
srun --jobid=<jobID> --pty htop -u $USER -s PERCENT_CPU --filter "python"
```

Inside `htop` you can repeatedly press "Shift+H" to show individual threads or group them inside their parent
processes.


## Slow series

When I teach parallel computing in other languages (Julia, Chapel), the approach is to take a numerical
problem and parallelize it using multiple processors, and concentrate on various issues and bottlenecks
(variable locks, load balancing, false sharing, messaging overheads, etc.) that lead to less than 100%
**parallel efficiency**. For the numerical problem I usually select something that is very simple to code, yet
forces the machine to do brute-force calculation that cannot be easily optimized.

One such problem is a *slow series*. It is a well-known fact that the harmonic series
$\sum\limits_{k=1}^\infty{1\over k}$ diverges. It turns out that if we omit the terms whose denominators in
decimal notation contain any *digit* or *string of digits*, it converges, albeit very slowly (Schmelzer &
Baillie 2008), e.g.

![](/fig/slow.png)

But this slow convergence is actually good for us: our answer will be bounded by the exact result
(22.9206766192...) on the upper side, and we will force the computer to do CPU-intensive work. We will sum all
the terms whose denominators do not contain the digit "9", and evaluate the first $10^8$ terms.

I implemented and timed this problem running in serial in Julia (356ms) and Chapel (300ms) -- both are
compiled languages. Here is one possible Python implementation:

```py
from time import time
def slow(n: int):
    total = 0
    for i in range(1,n+1):
        if not "9" in str(i):
            total += 1.0 / i
    return total

start = time()
total = slow(100_000_000)
end = time()
print("Time in seconds:", round(end-start,3))
print(total)
```

Let's save this code inside the file `slowSeries.py` and run it. Depending on the power supplied to my
laptop's CPU (which I find varies quite a bit depending on the environment), I get the average runtime of
6.625 seconds. That's ~20X slower than in Julia and Chapel!

Note that for other problems you will likely see an even bigger (100-200X) gap between Python and the compiled
languages. In other languages looking for a substring in a decimal representation of a number takes a while,
and there you will want to code this calculation using integer numbers. If we also do this via integer
operations in Python:

```py
def digitsin(num: int):
    base = 10
    while 9//base > 0: base *= 10
    while num > 0:
        if num%base == 9: return True
        num = num//10
    return False

def slow(n: int):
    total = 0
    for i in range(1,n+1):
        if not digitsin(i):
            total += 1.0 / i
    return total
```

our code will be ~3-4X slower, so we will use the first version of the code with `if not "9" in str(i)` -- it
turns out that in this particular case Python's high-level substring search is actually quite well optimized!

<!-- serial.jl - 444.862 ms with digitSequence variable, 355.692 ms with fixed 9 -->
<!-- serial.chpl - 300 ms -->
<!-- direct calculation in Python: 8.741755723953247 s -->
<!-- vectorized functions in Python: 17.629069089889526 s -->

Looking at other problems, you will see that Python's performance is worse on "tightly coupled" calculations
on fundamental data types (integers, doubles), e.g. when you try to run the same arithmetic calculation on
many elements of an array which is often the case in numerical simulation.

On the other hand, Python performs much better (or "less worse" compared to other languages) when doing file
I/O and text processing. In addition, if your Python code spends most of its time in a compiled numerical
library (often written in C or C++, and more recently in languages like Rust), your overall performance might
be not that bad.

We will come back to the slow-series problem, but let's first take a look at speeding up Python with NumPy.


## NumPy vectorization

Python is dynamically typed, i.e. variables can change their type on the fly:

```py
a = 5
a = 'apple'
print(a)
```

This makes Python very flexible. Out of these variables you can form 1D lists, and these can be inhomogeneous
and can change values and types on the fly:

```py
a = [1, 2, 'Vancouver', ['Earth', 'Moon'], {'list': 'an ordered collection of values'}]
a[1] = 'Sun'
a
```

Python lists are very general and flexible, which is great for high-level programming, but it comes at a cost. The
Python interpreter can't make any assumptions about what will come next in a list, so it treats everything as a generic
object with its own type and size. As lists get longer, eventually performance takes a hit.

Python does not have any mechanism for a uniform/homogeneous list, where -- to jump to element #1000 -- you
just take the memory address of the very first element and then increment it by (element size in bytes)
x 999. **NumPy** library fills this gap by adding the concept of homogenous collections to python --
`numpy.ndarray`s -- which are multidimensional, homogeneous arrays of fixed-size items (most commonly
numbers).

1. This brings large performance benefits!
  - no reading of extra bits (type, size, reference count)
  - no type checking
  - contiguous allocation in memory
  - NumPy was written in C &nbsp;⇒&nbsp; pre-compiled
2. NumPy lets you work with mathematical arrays.

Lists and NumPy arrays behave very differently:

```py
a = [1, 2, 3, 4]
b = [5, 6, 7, 8]
a + b              # this will concatenate two lists: [1,2,3,4,5,6,7,8]
```

```py
import numpy as np
na = np.array([1, 2, 3, 4])
nb = np.array([5, 6, 7, 8])
na + nb            # this will sum two vectors element-wise: array([6,8,10,12])
na * nb            # element-wise product
```

### Vectorized functions on array elements

One of the big reasons for using NumPy is so you can do fast numerical operations on a large number of
elements. The result is another `ndarray`. In many calculations you can use replace the usual `for`/`while`
loops with functions on NumPy elements.

```py
a = np.arange(100)
a**2          # each element is a square of the corresponding element of a
np.log10(a+1)     # apply this operation to each element
(a**2+a)/(a+1)    # the result should effectively be a floating-version copy of a
np.arange(10) / np.arange(1,11)  # this is np.array([ 0/1, 1/2, 2/3, 3/4, ..., 9/10 ])
```

### Array broadcasting

An extremely useful feature of vectorized functions is their ability to operate between arrays of different
sizes and shapes, a set of operations known as *broadcasting*.

```py
a = np.array([0, 1, 2])    # 1D array
b = np.ones((3,3))         # 2D array
a + b          # `a` is stretched/broadcast across the 2nd dimension before addition;
               # effectively we add `a` to each row of `b`
```

In the following example both arrays are broadcast from 1D to 2D to match the shape of the other:

```py
a = np.arange(3)                     # 1D row;                a.shape is (3,)
b = np.arange(3).reshape((3,1))      # effectively 1D column; b.shape is (3, 1)
a + b                                # the result is a 2D array!
```

NumPy's broadcast rules are:

1. the shape of an array with fewer dimensions is padded with 1's on the left
1. any array with shape equal to 1 in that dimension is stretched to match the other array's shape
1. if in any dimension the sizes disagree and neither is equal to 1, an error is raised

```
First example above:
********************
a: (3,)   ->  (1,3)  ->  (3,3)
b: (3,3)  ->  (3,3)  ->  (3,3)
                                ->  (3,3)
Second example above:
*********************
a: (3,)  ->  (1,3)  ->  (3,3)
b: (3,1) ->  (3,1)  ->  (3,3)
                                ->  (3,3)
Example 3:
**********
a: (2,3)  ->  (2,3)  ->  (2,3)
b: (3,)   ->  (1,3)  ->  (2,3)
                                ->  (2,3)
Example 4:
**********
a: (3,2)  ->  (3,2)  ->  (3,2)
b: (3,)   ->  (1,3)  ->  (3,3)
                                ->  error
"ValueError: operands could not be broadcast together with shapes (3,2) (3,)"
```

Let's see how these rules work on a real-life problem!

### Converting velocity components

Consider a spherical dataset describing Earth's mantle convection, defined on a spherical grid $n_{lat}\times
n_{lon}\times n_r = 500\times 800\times 300$, with the total of $120\times10^6$ grid points. For each grid
point, we need to convert from the spherical (lateral - longitudinal - radial) velocity components to their
Cartesian equivalents.

<!-- Doing this by hand with Python's `for` loops would take many hours for 13e6 points. I used NumPy to vectorize -->
<!-- in one dimension, and that cut the time to ~5 mins. At first glance, a more complex vectorization would not -->
<!-- work, as NumPy would have to figure out which dimension goes where. Writing it carefully and following the -->
<!-- broadcast rules I made it work, with the correct solution at the end -- while the total compute time went down -->
<!-- to a couple seconds! -->

::: {.callout-note}
##### If short on memory
To run all code fragments in this example at full $500\times 800\times 300$ resolution, you will need to
increase your memory ask to ~12000M, or otherwise your Python session will be killed mid-way. Alternatively,
you can reduce the problem size by 4X (yielding much less impressive runtime difference) and continue
running with 3600M memory:
```py
nlat, nlon = nlat//2, nlon//2   # 30e6 grid points
```
:::

Let's initialize our problem:

```py
import numpy as np
from scipy.special import lpmv
import time

nlat, nlon, nr = 500, 800, 300   # 120e6 grid points
nlat, nlon = nlat//2, nlon//2    # 30e6 grid points

latitude = np.linspace(-90, 90, nlat)
longitude = np.linspace(0, 360, nlon)
radius = np.linspace(3485, 6371, nr)

# spherical velocity components: use Legendre Polynomials to set values
vlat = lpmv(0,3,latitude/90).reshape(nlat,1,1) + np.linspace(0,0,nr).reshape(nr,1) + np.linspace(0,0,nlon)
vlon = np.linspace(0,0,nlat).reshape(nlat,1,1) + np.linspace(0,0,nr).reshape(nr,1) + lpmv(0,2,longitude/180-1.)
vrad = np.linspace(0,0,nlat).reshape(nlat,1,1) + lpmv(0,3,(radius-4928)/1443).reshape(nr,1) + np.linspace(0,0,nlon)

# Cartesian velocity components
vx = np.zeros((nlat,nr,nlon))
vy = np.zeros((nlat,nr,nlon))
vz = np.zeros((nlat,nr,nlon))
```

We need to go through all grid points, and at each point perform a matrix-vector multiplication. Here is our
first attempt:

::: {.callout-note}
You might want to use Python's `tqdm` library to provide a progress bar for real-time runtime estimate.
:::

```py
from tqdm import tqdm
start = time.time()
for i in tqdm(range(nlat)):
    for k in range(nlon):
        for j in range(nr):
            vx[i,j,k] = - np.sin(np.radians(longitude[k]))*vlon[i,j,k] - \
                np.sin(np.radians(latitude[i]))*np.cos(np.radians(longitude[k]))*vlat[i,j,k] + \
                np.cos(np.radians(latitude[i]))*np.cos(np.radians(longitude[k]))*vrad[i,j,k]
            vy[i,j,k] = np.cos(np.radians(longitude[k]))*vlon[i,j,k] - \
                np.sin(np.radians(latitude[i]))*np.sin(np.radians(longitude[k]))*vlat[i,j,k] + \
                np.cos(np.radians(latitude[i]))*np.sin(np.radians(longitude[k]))*vrad[i,j,k]
            vz[i,j,k] = np.cos(np.radians(latitude[i]))*vlat[i,j,k] + \
                np.sin(np.radians(latitude[i]))*vrad[i,j,k]

end = time.time()
print("Time in seconds:", round(end-start,3))
```

On my laptop this calculation took 1280s. There are quite a lot of redundancies (repeated calculations),
e.g. we compute all angles to radians multiple times, compute $\sin$ and $\cos$ of the same latitude and
longitude multiple times, and so on. Getting rid of these redundancies:

```py
from tqdm import tqdm
start = time.time()
for i in tqdm(range(nlat)):
    lat = np.radians(latitude[i])
    sinlat = np.sin(lat)
    coslat = np.cos(lat)
    for k in range(nlon):
        lon = np.radians(longitude[k])
        sinlon = np.sin(lon)
        coslon = np.cos(lon)
        for j in range(nr):
            vx[i,j,k] = - sinlon*vlon[i,j,k] - sinlat*coslon*vlat[i,j,k] + coslat*coslon*vrad[i,j,k]
            vy[i,j,k] = coslon*vlon[i,j,k] - sinlat*sinlon*vlat[i,j,k] + coslat*sinlon*vrad[i,j,k]
            vz[i,j,k] = coslat*vlat[i,j,k] + sinlat*vrad[i,j,k]

end = time.time()
print("Time in seconds:", round(end-start,3))
```

brings the runtime down to 192s.

::: {.callout-caution collapse="true"}
## Question 1
Does using NumPy's matrix-vector multiplication function `np.dot` speed this calculation? In this workflow, *at
each latitude-longitude* you would define a rotation matrix and *at each point* a spherical velocity vector to
compute their dot product, i.e. you would replace this fragment:
```py
        for j in range(nr):
            vx[i,j,k] = - sinlon*vlon[i,j,k] - sinlat*coslon*vlat[i,j,k] + coslat*coslon*vrad[i,j,k]
            vy[i,j,k] = coslon*vlon[i,j,k] - sinlat*sinlon*vlat[i,j,k] + coslat*sinlon*vrad[i,j,k]
            vz[i,j,k] = coslat*vlat[i,j,k] + sinlat*vrad[i,j,k]
```
with this one:
```py
        rot = np.array([[-sinlon, -sinlat*coslon, coslat*coslon],
                        [coslon, -sinlat*sinlon, coslat*sinlon],
                        [0, coslat, sinlat]])
        for j in range(nr):
            vspherical = np.array([vlon[i,j,k], vlat[i,j,k], vrad[i,j,k]])
            vx[i,j,k], vy[i,j,k], vz[i,j,k] = np.dot(rot, vspherical)
```
:::

<!-- Answer: nope, the computation time goes up to 293s. -->

To speed up our computation, we should vectorize over one of the dimensions, e.g. **longitudes**:

```py
from tqdm import tqdm
start = time.time()
lon = np.radians(longitude[0:nlon])
sinlon = np.sin(lon)
coslon = np.cos(lon)
for i in tqdm(range(nlat)):
    lat = np.radians(latitude[i])
    sinlat = np.sin(lat)
    coslat = np.cos(lat)
    for j in range(nr):
        vx[i,j,0:nlon] = - sinlon*vlon[i,j,0:nlon] - sinlat*coslon*vlat[i,j,0:nlon] + coslat*coslon*vrad[i,j,0:nlon]
        vy[i,j,0:nlon] = coslon*vlon[i,j,0:nlon] - sinlat*sinlon*vlat[i,j,0:nlon] + coslat*sinlon*vrad[i,j,0:nlon]
        vz[i,j,0:nlon] = coslat*vlat[i,j,0:nlon] + sinlat*vrad[i,j,0:nlon]

end = time.time()
print("Time in seconds:", round(end-start,3))
```

Let's see how broadcasting works in here. Consider the dimensions of all variables in the expression for
computing `vx[i,j,0:nlon]`:

```txt
[1,1,nlon] = - [nlon]*[1,1,nlon] - [1]*[nlon]*[1,1,nlon] + [1]*[nlon]*[1,1,nlon]
```

Omitting scalars ([1]), padding with 1's on the left will produce:

```txt
[1,1,nlon] = - [1,1,nlon]*[1,1,nlon] - [1,1,nlon]*[1,1,nlon] + [1,1,nlon]*[1,1,nlon]
```

Now all variables have the same dimensions, and all operations will be element-wise. The calculation time is
now 3.503s, which is a huge improvement!


Vectorizing over two dimensions, e.g. over **radii** and **longitudes** leaves us with a single loop:

```py
from tqdm import tqdm
start = time.time()
lon = np.radians(longitude[0:nlon])
sinlon = np.sin(lon)
coslon = np.cos(lon)
for i in tqdm(range(nlat)):
    lat = np.radians(latitude[i])
    sinlat = np.sin(lat)
    coslat = np.cos(lat)
    vx[i,0:nr,0:nlon] = - sinlon*vlon[i,0:nr,0:nlon] - sinlat*coslon*vlat[i,0:nr,0:nlon] + coslat*coslon*vrad[i,0:nr,0:nlon]
    vy[i,0:nr,0:nlon] = coslon*vlon[i,0:nr,0:nlon] - sinlat*sinlon*vlat[i,0:nr,0:nlon] + coslat*sinlon*vrad[i,0:nr,0:nlon]
    vz[i,0:nr,0:nlon] = coslat*vlat[i,0:nr,0:nlon] + sinlat*vrad[i,0:nr,0:nlon]

end = time.time()
print("Time in seconds:", round(end-start,3))
```

Let's check broadcasting, taking the expression for computing `vx[i,0:nr,0:nlon]` and omitting scalars ([1]):

```txt
[1,nr,nlon] = - [nlon]*[1,nr,nlon] - [1]*[nlon]*[1,nr,nlon] + [1]*[nlon]*[1,nr,nlon]          # original dimensions
[1,nr,nlon] = - [1,1,nlon]*[1,nr,nlon] - [1,1,nlon]*[1,nr,nlon] + [1,1,nlon]*[1,nr,nlon]      # after padding with 1's on the left
[1,nr,nlon] = - [1,nr,nlon]*[1,nr,nlon] - [1,nr,nlon]*[1,nr,nlon] + [1,nr,nlon]*[1,nr,nlon]   # after stretching 1's
```

Now all variables have the same dimensions, and all operations will be element-wise. The calculation time goes
down to 1.487s.

::: {.callout-note}
##### More vectorization != always faster
You can also vectorize in all three dimensions (latitudes, radii and longitudes), resulting in no explicit
Python loops in your calculation at all, but this requires a little bit of extra work, and the computation time
will actually slightly go up -- **any idea why**?
:::

In the end, with NumPy's element-wise vectorized operations, we improved our time from 1280s to 1.5s! The
entire calculation was done as a batch in a compiled C code, instead of cycling through individual elements
and then calling a compiled C code on each. There are a lot fewer Python code lines to interpret and run.

### Back to the slow series

Let's use NumPy for our slow series calculation. We will:

1. write a function that acts on each integer $k$ in the series,
2. convert this function to a *vectorized function* that takes in **an array of integer numbers** and returns
   **an array of summation terms**,
3. sum these terms to get the result.

Let's save the following code as `slow2.py`:

```py
from time import time
import numpy as np
n = 100_000_000
def combined(k):
    if "9" not in str(k):
        return 1.0/k
    else:
        return 0.0

v3 = np.vectorize(combined)
start = time()
i = np.arange(1,n+1)
total = np.sum(v3(i))
end = time()
print("Time in seconds:", round(end-start,3))
print(total)
```

::: {.callout-note}
This particular code uses a lot of memory, but you have 2 options:

1. restart the job with `--mem-per-cpu=14400` request (and single core) -- probably not a good idea in a
   shared environment
1. reduce `n` by 10X and then scale the time estimate by 10X -- easy and will give a rough estimate, but the
   summation time does not scale exactly linearly with the number of terms (larger numbers have a higher
   probability of having 9's in them)
1. break one large numpy array into pieces, as shown in the code below
:::

```py
from time import time
import numpy as np
n = 100_000_000
def combined(k):
    if "9" not in str(k):
        return 1.0/k
    else:
        return 0.0

size = n//10   # 10 pieces: 1-10e6, 10,000,001-20e6, ..., 90,000,001-100e6
intervals = [(i*size+1,(i+1)*size) for i in range(10)]
if n > intervals[-1][1]:
    intervals[-1] = (intervals[-1][0], n)

v3 = np.vectorize(combined)
start = time()
total = 0
for span in intervals:
    i = np.arange(span[0],span[1]+1)
    total += np.sum(v3(i))

end = time()
print("Time in seconds:", round(end-start,3))
print(total)
```

The vectorized function is supposed to speed up calculating the terms, but our time becomes significantly
worse (14.72s) than the original calculation (6.625s). **The reason**: we are no longer replacing multiple
Python lines with a single line. The code inside `combined()` is still native Python code that is being
interpreted on the fly, and we applying all its lines to each element of array `i`.

The function `np.vectorize` does not compile `combined()` -- it simply adapts it to work with arrays, but
underneath you are still running Python loops. If, instead, our vectorization could produce a *compiled
function* that takes an array of integers and computes the slow series sum all inside the same
C/C++/Fortran/Rust function, it would run ~20X faster than the original calculation. This is what a JIT
compiler can do -- we will study it later.

As it turns out, to speed up this problem with NumPy, you really need to perform the check for digit "9" via
low-level custom NumPy code with a combination of vectorizable integer and floating operations, and this is
very difficult -- but not impossible -- to implement.


<!-- Here are a couple other implementations: -->

<!-- In the 2nd NumPy version (average runtime 16.04s) we create a vectorized function returning an array of -->
<!-- True/False for input integers $i$ checking for the presence of "9" in their decimal notation, use this output -->
<!-- array as a mask to $i$, then invert all elements in the resulting array, and finally sum them up: -->

<!-- ```py -->
<!-- from time import time -->
<!-- import numpy as np -->
<!-- n = 100_000_000 -->
<!-- nodigitsin = np.vectorize(lambda x: "9" not in str(x)) -->
<!-- inverse = np.vectorize(lambda x: 1.0/x) -->
<!-- start = time() -->
<!-- i = np.arange(1,n+1) -->
<!-- total = np.sum(inverse(i[nodigitsin(i)])) -->
<!-- end = time() -->
<!-- print("Time in seconds:", round(end-start,3)) -->
<!-- print(total) -->
<!-- ``` -->

<!-- In the 3rd NumPy version (average runtime 23.47s) we create a vectorized function returning an array of -->
<!-- True/False for input integers $i$, and another vectorized function inverting all $i$, and compute a dot -->
<!-- product of their respective outputs (True/False are 1/0), summing up all its elements: -->

<!-- ```py -->
<!-- from time import time -->
<!-- import numpy as np -->
<!-- n = 100_000_000 -->
<!-- nodigitsin = np.vectorize(lambda x: "9" not in str(x)) -->
<!-- inverse = np.vectorize(lambda x: 1.0/x) -->
<!-- start = time() -->
<!-- i = np.arange(1,n+1) -->
<!-- total = np.inner(inverse(i),nodigitsin(i)) -->
<!-- end = time() -->
<!-- print("Time in seconds:", round(end-start,3)) -->
<!-- print(total) -->
<!-- ``` -->


<!-- terms vectorize pretty well, but most of the time is taken by summation => need to parallelize -->


## Parallelization

We can only start parallelizing the code when we are certain that our serial performance is not too bad, or at
the very least we have optimized our serial Python code as much as possible. At this point, *we don't know* if
we have the best optimized Python code, as we are only starting to look into various tools that (hopefully)
could speed up our code. We know that our code is ~20X slower than a compiled code in Julia/Chapel/C/Fortran,
but do we have the best-performing Python code?

Next we'll try to speed up our code with NumExpr expression evaluator, in which simple mathematical / NumPy
expressions can be parsed and then evaluated using *compiled C code*. NumExpr has an added benefit in that you
can do this evaluation with multiple threads in parallel. But first we should talk about threads and
processes.


<!-- 1. You already have a fairly large code that you run in serial, and you want to speed it up on multiple cores -->
<!--    or nodes; can you optimize this code before parallelizing it? profiling tools -->
<!-- 2. If your code is I/O-bound (not CPU/GPU-bound) - perfect case for parallelization, especially when run on a cluster -->
<!-- 3.  -->


### Threads vs processes

In Unix a **process** is the smallest independent unit of processing, with its own memory space -- think of an
instance of a running application. The operating system tries its best to isolate processes so that a problem
with one process doesn't corrupt or cause havoc with another process. Context switching between processes is
relatively expensive.

A process can contain multiple **threads**, each running on its own CPU core (parallel execution), or sharing
CPU cores if there are too few CPU cores relative to the number of threads (parallel + concurrent
execution). All threads in a Unix process share the virtual memory address space of that process, e.g. several
threads can update the same variable, whether it is safe to do so or not (we'll talk about thread-safe
programming in this course). Context switching between threads of the same process is less expensive.

![](/fig/threads.png)
<!-- "Image copied from https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads"  -->

- Threads within a process communicate via shared memory, so **multi-threading** is always limited to shared
  memory within one node.
- Processes communicate via messages (over the cluster interconnect or via shared
  memory). **Multi-processing** can be in shared memory (one node, multiple CPU cores) or distributed memory
  (multiple cluster nodes). With multi-processing there is no scaling limitation, but traditionally it has
  been more difficult to write code for distributed-memory systems. Various libraries tries to simplify it
  with high-level abstractions.

> ### Discussion
> What are the benefits of each type of parallelism: multi-threading vs. multi-processing?
> Consider:
> 1. context switching, e.g. starting and terminating or concurrent execution on the same CPU core,
> 1. communication,
> 1. scaling up.


## Multithreading

Python uses *reference counting* for memory management. Each object in memory that you create in Python has a
counter storing the number of references to that object. Once this counter reaches zero (when you delete
objects), Python releases memory occupied by the object. If you have multiple threads, letting them modify the
same counter at the same time can lead to a race condition where you can write an incorrect value to the
counter, leading to either memory leaks (too much memory allocated) or to releasing memory incorrectly when a
there is still a reference to that object on another thread.

One way to solve this problem is to have locks on all shared data structures, where only one thread at a time
can modify data. This can also lead to problems, and Python's solution prior to version 3.13 is to use a lock
on the interpreter itself (Global Interpreter Lock = GIL), so that only **one thread can run at a
time**. Recall that in Python each line of code is being interpreted on the fly, so placing a lock on the
interpreter means pausing code execution while some other thread is running.

### Installing free-threaded Python

Starting with v3.13, you can install Python with the GIL removed. This mode called *free threading* is still
considered experimental and is usually not enabled by default. There is a good [Real Python
article](https://realpython.com/python313-free-threading-jit) (might require subscription) describing various
build options to enable free threading including enabling experimental JIT. Here is a simple setup on my
laptop:

```sh
uv venv ~/env-free --python 3.14t   # create a new virtual environment
source ~/env-free/bin/activate
...
deactivate
```

At the time of this writing, there is no free-threaded Python module on the clusters. Therefore, on the
training cluster I used `uv` to install free-threaded Python 3.14t into a world-readable subdirectory in
`/project/def-sponsor00/shared`:

::: {.callout-caution}
`uv` is [currently not supported on our clusters](https://docs.alliancecan.ca/wiki/Uv){target="_blank"} so use
it with caution.
:::

```sh
curl -LsSf https://astral.sh/uv/install.sh | sh   # install uv
# make sure ~/.local/bin is in your path (the location of `uv` binary)

uv venv /project/def-sponsor00/shared/env-free --python 3.14t   # create a new virtual environment
source /project/def-sponsor00/shared/env-free/bin/activate
...
which python             # /project/def-sponsor00/shared/env-free/bin/python
ls -l $(which python) n  # symbolic link to ~user01/.local/share/uv/python/...
deactivate

countfiles /project/def-sponsor00/shared/env-free   # 15 files
countfiles ~/.local/share/uv/python                 # 5,508 files
```

### Running free-threaded Python

Python has several multithreading modules/libraries in its Standard Library. For example, `threading` library
provides a function `Thread()` to launch new threads. Here, instead, I will use `concurrent` library for
launching threads and `threading` library for printing thread IDs:

```py
import concurrent.futures as ft
import threading as t

t.current_thread().name   # 'MainThread'

def worker(x):
    print(t.current_thread().name)

with ft.ThreadPoolExecutor() as pool:
    results = pool.map(worker, range(30))   # in the output 'ThreadPoolExecutor-0_1":
                                            # 0 is the pool number
                                            # 1 is the thread number
```

In this simple code above we'll see output mostly from threads 0, 1, maybe 2, as they are not very busy.

Here is a full example of the slow series calculation (store it as `sum.py`) with native multithreading:

<!-- that instead uses `futures.ThreadPoolExecutor()` function from `concurrent` library: -->

```py
from time import time
import argparse
import concurrent.futures as ft
import threading as t

parser = argparse.ArgumentParser()
parser.add_argument('--n', default=100_000_000, type=int, help='number of terms')
parser.add_argument('--nthreads', default=1, type=int, help='number of tasks')
args = parser.parse_args()
n, nthreads = args.n, args.nthreads

def slow(interval):
    print(t.current_thread().name)
    total = 0
    for i in range(interval[0], interval[1]+1):
        if not "9" in str(i):
            total += 1.0 / i
    return total

size = n//nthreads   # size of each batch
intervals = [(i*size+1,(i+1)*size) for i in range(nthreads)]
if n > intervals[-1][1]: intervals[-1] = (intervals[-1][0], n)   # add the remainder, if any

print("running with", args.nthreads, "threads over", intervals)
start = time()
with ft.ThreadPoolExecutor() as pool:
    results = pool.map(slow, intervals)

end = time()
print("Time in seconds:", round(end-start,3))
print("sum =", sum(r for r in results))
```

If you run this code with v3.12 or earlier, you will get about the same runtime independently of the number of
threads, as -- due to the GIL -- these threads will be taking turns running. With a free-threaded Python 3.14
build, you will see better runtimes with additional threads, provided you have the CPU cores to run all
threads in parallel:

```sh
cd ~/tmp
source /project/def-sponsor00/shared/env-free/bin/activate
salloc --cpus-per-task=4 --time=0:60:0 --mem-per-cpu=3600
python sum.py
python sum.py --nthreads=2
python sum.py --nthreads=4
python sum.py --nthreads=8   # do we get further 2X speedup?
```
```output
running with 1 threads over [(1, 100000000)]
Time in seconds: 7.331
sum = 13.277605949858103
```
```output
running with 2 threads over [(1, 50000000), (50000001, 100000000)]
Time in seconds: 3.801
sum = 13.277605949855722
```
```output
running with 4 threads over [(1, 25000000), (25000001, 50000000),
                              (50000001, 75000000), (75000001, 100000000)]
Time in seconds: 2.977
sum = 13.277605949854326
```

Note that the free-threaded build has some additional overhead when executing Python code compared to the
default GIL-enabled build. In 3.13, this overhead is ~40%, in 3.14 seems to be somewhat smaller.

### NumExpr with regular Python

Alternatively, you can do multithreading in Python via 3rd-party libraries that were written in other
languages in which there is no GIL. One such famous library is NumExpr which is essentially a compiler for
NumPy operations. It takes its input NumPy expression as a string and can run it with multiple threads.

- supported operators: - + - * / % << >> < <= == != >= > & | ~ **
- supported functions: where, sin, cos, tan, arcsin, arccos arctan, arctan2, sinh, cosh, tanh, arcsinh,
  arccosh arctanh, log, log10, log1p, exp, expm1, sqrt, abs, conj, real, imag, complex, contains
- supported reductions: sum, product
- supported data types: 8-bit boolean, 32-bit signed integer (int or int32), 64-bit signed integer (long or
  int64), 32-bit single-precision floating point number (float or float32), 64-bit double-precision floating
  point number (double or float64), 2x64-bit double-precision complex number (complex or complex128), raw
  string of bytes

Here is a very simple NumPy example, timing only the calculation itself:

```py
from time import time
import numpy as np

n = 100_000_000
a = np.random.rand(n)

start = time()
b = np.sin(2*np.pi*a)**2 + np.cos(2*np.pi*a)**2
end = time()
print("Time in seconds:", round(end-start,3))
print(a,b)
```

I get the average runtime of 2.04s. Here is how you would implement this with NumExpr:

```py
from time import time
import numpy as np, numexpr as ne

n = 100_000_000
a = np.random.rand(n)

prev = ne.set_num_threads(1)   # specify the number of threads, returns the previous setting
print(prev)                    # on first use tries to grab all cores
start = time()
twoPi = 2*np.pi
b = ne.evaluate('sin(twoPi*a)**2 + cos(twoPi*a)**2')
end = time()
print("Time in seconds:", round(end-start,3))
print(a,b)
```

Average time: