The Analytic Garden: I Really Want to Like Julia...

I really do. I want to like Julia. There is much to like: a great REPL, multiple dispatch; concurrent, parallel, and distributed computing; direct calling of C and Fortran libraries; dynamic type system; nice package manager; macros, etc.

There are problems with Julia that may be showstoppers for me. My usual workflow with Python, R, Java, or C++ is to write code in small pieces and incrementally test, building the program one routine at a time. For example, I typically write the input routine, test; write the data processing steps, one step at a time and test; write plotting routines and test. Test the whole program and fix any problems. I really should write the tests first like I tell students, but sometimes I cheat.

The process described above is common. Julia makes approaching programming in that manner frustrating. The source of the frustration is compile time latency. Julia's JIT compiler results in great execution speeds, but you pay a price each time you load Julia or load a library module. I understand why this happens. It's so Julia can do type analysis at compile time for multiple dispatch. Multiple dispatch is key to Julia's functioning, so compile times may improve in the future, but it will always be slower than a language like Python which doesn't do as much static analysis.

How Bad Is It?

Recently, I was trying to use Julia for a small project. The details don't matter. The program flow was to read a small CSV file, do some analysis using two of the data columns, and produces four or five plots. I wasn't familiar with Julia's plotting routines so I decided to test plotting before proceeding with analysis. I started in my usual way. I fired up vscode and wrote some code to read the CSV file and use two of the data columns to make a scatter plot.

Here's some Julia code. The data is temperature anomaly data from the NASA Global Climate Change site.

using DataFrames
using CSV
using PyCall
@pyimport matplotlib
@pyimport matplotlib.pyplot as plt

function plot_temperature(df)
    size = 3

    f1 = plt.figure()
    ax1 = f1.gca()
    ax1.plot(df.Year, df.Temperature_Anomaly, "o", markersize = size)
    plt.xlabel("Year")
    plt.ylabel("Temerature Anomaly")
    plt.title("Temperature Anomaly")

    plt.show(block = false)
end

function main()
    input_file = "../data//HadCRUT.4.6.0.0.annual_ns_avg.csv"

    df = DataFrame(CSV.File(input_file))
    plot_temperature(df)
end

main()

If you run this, the plot flashes on the screen so briefly that you can't see it. That's deliberate because I wanted to test how long it took bring up a plot.

$ time julia test_plot.jl
┌ Warning: `vendor()` is deprecated, use `BLAS.get_config()` and inspect the output instead
│   caller = npyinitialize() at numpy.jl:67
└ @ PyCall ~/.julia/packages/PyCall/L0fLP/src/numpy.jl:67

real    0m25.150s
user    0m24.921s
sys     0m1.319s

Here's what the output should look like.

This example uses Julia 1.7. Twenty-five seconds seems long for reading a 14k file and showing a plot. That warning has something to do with BLAS. Apparently, Julia compiles in BLAS whether you want it or not.

If I try to run this code from the REPL, loading the data file with DataFrame takes 18.9 seconds the first time. If I load the data a second time, it's instantaneous, 0.007288 seconds. That's an example of compile time latency. It's a killer the first time, but really zooms when you run the same code again.

If you search the various Julia forums for answers to slow compilation, the typical response is to work in the REPL. The problem is that I want to produce stand-alone code. It's hard to convince a non-fan of Julia that they need to fire up the REPL and things will be better the second time they run the program.

For contrast, here's some python code that does the same task. This is python 3.9.7.

import pandas as pd
import matplotlib.pyplot as plt

def plot_temperature(df):
    size = 3

    f1 = plt.figure()
    ax1 = f1.gca()
    ax1.plot(df["Year"], df["Temperature_Anomaly"], "o", markersize = size)
    plt.xlabel("Year")
    plt.ylabel("Temerature Anomaly")
    plt.title("Temperature Anomaly")

    plt.show(block = False)


def main():
    input_file = "../data//HadCRUT.4.6.0.0.annual_ns_avg.csv"

    df = pd.read_csv(input_file)
    plot_temperature(df)


if __name__ == "__main__":
    main()

$ time python test_plot.py

real    0m0.671s
user    0m0.691s
sys     0m1.132s

Python clocks in at 0.67 seconds. That's a time more like I would expect.

Just for good measure, let's see what happens with R. I usually run R from Rstudio. For a fair comparison, I ran an R script from the command line.

test_plot <- function() {
  library(ggplot2)
  
  df <- read.csv("../data/HadCRUT.4.6.0.0.annual_ns_avg.csv")
  p <- ggplot(df) + 
    geom_point(aes(x = Year, y = Temperature_Anomaly)) +
    labs(x = 'Year', y = 'Temperature Anomaly') +
    ggtitle('Temperature Anomaly')
  x11()
  plot(p)
}

test_plot()

$ time Rscript test_plot.R

real    0m0.715s
user    0m0.518s
sys     0m0.050s

R 4.1.2 runs the code in in a total time similar to python.

What Else?

You might wonder why I used PyCall and matplotlib rather than Plots.js. The reason is that Plots wouldn't work. Simply including using Plots produces no output from the program. Remember, I wanted multiple plot windows. Running from the REPL, the first plot was overwritten by the second. Using the gr() backend to Plots.js, displays a plot, but if a second plot is written. the first is overwritten.

For example,

using Plots

gr()

f = plot(1:10, rand(10), reuse = false)
f2 = plot(11:20, rand(10), reuse = false)

display(f)
display(f2)

println("Press <Enter> to exit...")
readline()

only produces one figure, the second. The reuse argument is ignored.

Using pyplot() instead of gr() produces no output from the program.

This version actually works the way I expect, albeit slowly. It requires gnuplot and produces warnings.

using Plots;

gaston()

f = plot(1:10, rand(10), reuse = false)
f2 = plot(11:20, rand(10), reuse = false)

display(f)
display(f2)

println("Press <Enter> to exit...")
readline()

The problem of not being able to plot properly from programs and difficulty in producing multiple plot windows has been around since 2018. It's seems like such a basic issue that I would have thought it would have been addressed by now.

Julia has great potential, but the Julia team must address some of these problems before Julia is fully ready for primetime.

The Analytic Garden

I Really Want to Like Julia...

How Bad Is It?

What Else?

No comments:

Post a Comment

Labels

Contributors

wfmu