Data processing


1.) Jupyter Notebooks (install, basic usage)

2.) FileIO and data manipulation (read, write, different data formats)

3.) Plotting (data visualization)

Installing Jupyter Notebook

In [595]:
using Pkg
Pkg.add("IJulia")
using IJulia
#notebook()
   Resolving package versions...
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml`
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`

The notebook interface

List of commands in the command palette

Ctrl + Shift + p or Command + Shift + p

Help

Navigation bar -> Help or press h if in Edit mode

Cells

  • Code cell, contains code to be executed
  • Markdown cell for text formatting
  • Use arrow keys to navigate up and down cells
  • Hit Return to enter into cell in coding mode or
  • press m and press Return to enter into markdown mode
  • Execute cell by pressing run in navigation bar or
  • press Ctrl + Return or Shift + Return
  • Press Esc to leave cell editing

Activating your enviroment

In [554]:
using Pkg
Pkg.activate(".")
  Activating environment at `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml`

FileIO

File stream

Reading and writing of text files:

In [555]:
f = open("output.txt", "w")
Out[555]:
IOStream(<file output.txt>)
In [556]:
write(f, "This is a first line of text.\n")
Out[556]:
30
In [557]:
write(f, "This is another line of text.\n")
Out[557]:
30
In [558]:
close(f)

Many things can go wrong when reading or writing files, therfore it is good practise
to catch possible errors and to make sure the filestream is closed again:

In [559]:
f = open("output.txt", "r")
try
    l = readline(f)
    l = readline(f)
    write(f, "last line of text.\n")
catch exc
    println("Something went wrong: $exc")
finally
    close(f)
end
Something went wrong: ArgumentError("write failed, IOStream is not writeable")

Another way of opening a stream for writing:

In [560]:
open("other_output.txt", "w") do f
    write(f, "text for another file.")
end
Out[560]:
22

JSON

Structure for universal data exchange.

In [561]:
Pkg.add("JSON");
using JSON
   Resolving package versions...
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml`
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`

Read and write JSON data:

Create sample dictionary to hold the data

In [562]:
# dictionary to write
dict1 = Dict("key1" => 1, "key2" => 2,
            "key3" => Dict("sub1"=>"some text","sub2"=>5.4335,"sub3"=>[3,5,7,3]))
Out[562]:
Dict{String, Any} with 3 entries:
  "key2" => 2
  "key3" => Dict{String, Any}("sub2"=>5.4335, "sub3"=>[3, 5, 7, 3], "sub1"=>"so…
  "key1" => 1
In [563]:
# cast dict into json format
stringdata = JSON.json(dict1)
Out[563]:
"{\"key2\":2,\"key3\":{\"sub2\":5.4335,\"sub3\":[3,5,7,3],\"sub1\":\"some text\"},\"key1\":1}"

Writing text string to output stream:

In [564]:
f = open("out_json.json", "w")
try
    write(f, stringdata)
catch exc
    println("Something went wrong: $exc")
finally
    close(f)
end
Out[564]:
78

Reading back the JSON file:

In [565]:
jDict = Dict()
open("out_json.json", "r") do f
    global jDict
    jDict=JSON.parse(f)  # parse and transform data
end
Out[565]:
Dict{String, Any} with 3 entries:
  "key2" => 2
  "key3" => Dict{String, Any}("sub2"=>5.4335, "sub3"=>Any[3, 5, 7, 3], "sub1"=>…
  "key1" => 1

For other IO (Multimedia, Network) there is extensive documentation:
https://docs.julialang.org/en/v1/base/io-network/

Installing a few more Packages

In [566]:
Pkg.add("CSV");
Pkg.add("XLSX");
Pkg.add("DataFrames");
   Resolving package versions...
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml`
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`
   Resolving package versions...
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml`
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`
   Resolving package versions...
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml`
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`
In [567]:
using CSV
using DataFrames
using XLSX

Loading Excel Data

In [568]:
xf = XLSX.readxlsx("demo_data.xlsx")
Out[568]:
XLSXFile("demo_data.xlsx") containing 1 Worksheet
            sheetname size          range        
-------------------------------------------------
               Sheet1 6x3           A1:C6        
In [569]:
XLSX.sheetnames(xf)
Out[569]:
1-element Vector{String}:
 "Sheet1"
In [570]:
sh = xf["Sheet1"]
typeof(sh)
Out[570]:
XLSX.Worksheet

Selecting all data from Sheet

In [571]:
sh[:]
Out[571]:
6×3 Matrix{Any}:
 "Name"                "Salary"  "Entry"
 "Jon Doo"        75000          2010-10-23
 "Rose Bush"      60000          2013-03-04
 "Anne Teak"      83000          2007-11-12
 "Rod Knee"       55000          2019-08-23
 "Hank R. Cheef"  98000          2001-04-29
In [572]:
# Selecting only 2 rows and 2 columns
sh["A1:B3"]
Out[572]:
3×2 Matrix{Any}:
 "Name"            "Salary"
 "Jon Doo"    75000
 "Rose Bush"  60000

Lets cast the Matrix into a DataFrame

In [573]:
# Leave the first row out (column names)
hr = sh["A2:C6"] |> DataFrame
Out[573]:

5 rows × 3 columns

x1x2x3
AnyAnyAny
1Jon Doo750002010-10-23
2Rose Bush600002013-03-04
3Anne Teak830002007-11-12
4Rod Knee550002019-08-23
5Hank R. Cheef980002001-04-29
In [574]:
# Manually rename the columns
rename!(hr, [:Name, :Salary, :Entry_Date])
Out[574]:

5 rows × 3 columns

NameSalaryEntry_Date
AnyAnyAny
1Jon Doo750002010-10-23
2Rose Bush600002013-03-04
3Anne Teak830002007-11-12
4Rod Knee550002019-08-23
5Hank R. Cheef980002001-04-29

We can combine all these steps into one:

In [575]:
hr = DataFrame(XLSX.readtable("demo_data.xlsx", "Sheet1")...)
Out[575]:

5 rows × 3 columns

NameSalaryEntry
AnyAnyAny
1Jon Doo750002010-10-23
2Rose Bush600002013-03-04
3Anne Teak830002007-11-12
4Rod Knee550002019-08-23
5Hank R. Cheef980002001-04-29
In [576]:
typeof(hr)
Out[576]:
DataFrame

Let's save the dataframe as CSV file:

In [577]:
CSV.write("HR_Sheet.csv", hr)
Out[577]:
"HR_Sheet.csv"

Read in CSV file:

In [578]:
hr_csv = CSV.File("HR_Sheet.csv")
Out[578]:
5-element CSV.File{false}:
 CSV.Row: (Name = "Jon Doo", Salary = 75000, Entry = Date("2010-10-23"))
 CSV.Row: (Name = "Rose Bush", Salary = 60000, Entry = Date("2013-03-04"))
 CSV.Row: (Name = "Anne Teak", Salary = 83000, Entry = Date("2007-11-12"))
 CSV.Row: (Name = "Rod Knee", Salary = 55000, Entry = Date("2019-08-23"))
 CSV.Row: (Name = "Hank R. Cheef", Salary = 98000, Entry = Date("2001-04-29"))
In [579]:
# Select the Name field of the first CSV row
hr_csv[1][:Name]
Out[579]:
"Jon Doo"

Use DataFrames instead of CSV.Row

In [580]:
hr_csv = CSV.File("HR_Sheet.csv", header=true) |> DataFrame
Out[580]:

5 rows × 3 columns

NameSalaryEntry
StringInt64Date
1Jon Doo750002010-10-23
2Rose Bush600002013-03-04
3Anne Teak830002007-11-12
4Rod Knee550002019-08-23
5Hank R. Cheef980002001-04-29

Now, let's read another Excel file:

In [581]:
hr_departments = DataFrame(XLSX.readtable("demo_data2.xlsx", "Sheet1")...)
Out[581]:

5 rows × 2 columns

NameDepartment
AnyAny
1Jon DooHR
2Rose BushR&D
3Anne TeakR&D
4Rod KneeMarketing
5Hank R. CheefGM

Casting into datatypes while reading:

In [511]:
hr_departments = DataFrame(XLSX.readtable("demo_data2.xlsx", "Sheet1", infer_eltypes=true)...)
Out[511]:

5 rows × 2 columns

NameDepartment
StringString
1Jon DooHR
2Rose BushR&D
3Anne TeakR&D
4Rod KneeMarketing
5Hank R. CheefGM

Let's join both tables into one:

In [512]:
hr_all = innerjoin(hr_csv, hr_departments, on=:Name)
Out[512]:

5 rows × 4 columns

NameSalaryEntryDepartment
StringInt64DateString
1Jon Doo750002010-10-23HR
2Rose Bush600002013-03-04R&D
3Anne Teak830002007-11-12R&D
4Rod Knee550002019-08-23Marketing
5Hank R. Cheef980002001-04-29GM

Selecting fields and sub-sets

Selecting a single column:

In [513]:
hr_all.Name
Out[513]:
5-element Vector{String}:
 "Jon Doo"
 "Rose Bush"
 "Anne Teak"
 "Rod Knee"
 "Hank R. Cheef"
In [514]:
hr_all[!, :Name]
Out[514]:
5-element Vector{String}:
 "Jon Doo"
 "Rose Bush"
 "Anne Teak"
 "Rod Knee"
 "Hank R. Cheef"

Sub-setting by condition:

In [515]:
# Our condition:
hr_all.Salary .> 75000
Out[515]:
5-element BitVector:
 0
 0
 1
 0
 1
In [516]:
hr_all[hr_all.Salary .> 75000, :]
Out[516]:

2 rows × 4 columns

NameSalaryEntryDepartment
StringInt64DateString
1Anne Teak830002007-11-12R&D
2Hank R. Cheef980002001-04-29GM

Sub-setting columns by name:

In [519]:
hr_all[!, [:Name, :Entry]]
Out[519]:

5 rows × 2 columns

NameEntry
StringDate
1Jon Doo2010-10-23
2Rose Bush2013-03-04
3Anne Teak2007-11-12
4Rod Knee2019-08-23
5Hank R. Cheef2001-04-29

Using the Query.jl package

In [520]:
Pkg.add("Query");
using Query;
using Statistics
   Resolving package versions...
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml`
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`

Pipe operator "|>" is used to feed the output of an operation into the input of the next function, e.g. select, filter and sub-set dataframes:
The command from above:

hr_all[hr_all.Salary .> 75000, :]

can be written as:

In [521]:
hr_all |> @filter(_.Salary > 75000)
Out[521]:
NameSalaryEntryDepartment
"Anne Teak"83000Date("2007-11-12")"R&D"
"Hank R. Cheef"98000Date("2001-04-29")"GM"

This way we can pipe easily multiple commands:

In [522]:
top = hr_all |> 
            @select(:Name, :Salary, :Department) |> 
            @filter(_.Salary > 75000) |>
            DataFrame
Out[522]:

2 rows × 3 columns

NameSalaryDepartment
StringInt64String
1Anne Teak83000R&D
2Hank R. Cheef98000GM
In [523]:
df2 = hr_all |>
            @groupby(_.Department) |>
            @map({Department=key(_), Salary=mean(_.Salary)}) |>
            DataFrame
Out[523]:

4 rows × 2 columns

DepartmentSalary
StringFloat64
1HR75000.0
2R&D71500.0
3Marketing55000.0
4GM98000.0

Detailed description of parameters for import and export can be found in the package documentation:

CSV.jl

https://csv.juliadata.org/stable/

XLSX.jl

https://felipenoris.github.io/XLSX.jl/stable/

For example in CSV.jl you can normalize the column names to remove unwanted characters
as whitespaces etc.

How to download data from url:

In [582]:
d=download("https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KCLT.csv")
Out[582]:
"/var/folders/p5/vnkq8n3d0s9_ftw6ktg1b4bcrqfp_x/T/jl_aOviFf"
In [583]:
weather = CSV.File(d) |> DataFrame
describe(weather)
Out[583]:

13 rows × 7 columns

variablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1date2014-07-012015-06-300Date
2actual_mean_temp61.04931863.0880Int64
3actual_min_temp49.9589752.0750Int64
4actual_max_temp71.63012673.01000Int64
5average_min_temp48.81922948.0680Int64
6average_max_temp70.98365072.0890Int64
7record_min_temp31.4658-530.0620Int64
8record_max_temp88.72886990.01040Int64
9record_min_temp_year1953.2818791963.020150Int64
10record_max_temp_year1953.9918791953.020150Int64
11actual_precipitation0.1024110.00.02.650Float64
12average_precipitation0.1140820.090.110.150Float64
13record_precipitation2.20890.851.986.880Float64

Iris Dataset

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm
  5. class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica
In [584]:
Pkg.add("RDatasets");
using RDatasets;
   Resolving package versions...
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml`
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`
In [585]:
# Loading the Iris dataset:
iris = dataset("datasets", "iris");
size(iris)
Out[585]:
(150, 5)

Explore the table by using first(<data>, nrows)

In [586]:
first(iris, 3)
Out[586]:

3 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
In [587]:
# Using sclicing
iris[1:3, :]
Out[587]:

3 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
In [588]:
describe(iris)
Out[588]:

5 rows × 7 columns

variablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1SepalLength5.843334.35.87.90Float64
2SepalWidth3.057332.03.04.40Float64
3PetalLength3.7581.04.356.90Float64
4PetalWidth1.199330.11.32.50Float64
5Speciessetosavirginica0CategoricalValue{String, UInt8}

Plotting data

using Plots.jl

In [589]:
Pkg.add("Plots")
using Plots
   Resolving package versions...
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml`
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`

Basic line plot:

In [590]:
plot(1:10, rand(10))
Out[590]:

we can add more data:

In [591]:
plot!(rand(10).+1)
Out[591]:
In [592]:
plot([sin,cos], 0:0.1:Ï€)
Out[592]:

Plots with annotations:

In [593]:
x = 1:10
y = rand(10, 2)

plot(x, y, title = "Two Lines", label = ["Line 1" "Line 2"], lw = 3)
Out[593]:

Selecting different the plotting backends

Plot.jl uses a default backend for plotting

  • GR (fast)
  • Plotly (interactive)
  • Pyplot
  • etc

Changing the plotting backend

Switch to interactive plotting:

In [537]:
plotly()
Out[537]:
Plots.PlotlyBackend()
In [538]:
plot(x, y, title = "Two Lines", label = ["Line 1" "Line 2"], lw = 3)
Out[538]:
Plots.jl

Adding layout and combining different plots:

In [539]:
gr()

p1 = plot(x, y, xlabel = "X Axis", title = "Lineplot") # Line plot
p2 = scatter(x, y) # Scatter plot
p3 = plot(x, y, lw = 3) # Another line plot
p4 = histogram(x, y) # Histogram
plot(p1, p2, p3, p4, layout = (2, 2), legend = false)
Out[539]:

Using the StatsPlots package

In [540]:
Pkg.add("StatsPlots")
using StatsPlots
   Resolving package versions...
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml`
  No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`

Moving back to the Iris dataset:

Scatter plot the petal length vs. petal width

In [541]:
@df iris scatter(:PetalLength, :PetalWidth)
Out[541]:

With StatsPlots we can do a correlation plot of selected parameters:

In [542]:
@df iris corrplot([:SepalLength :SepalWidth :PetalLength :PetalWidth], grid = false)
Out[542]:

Histogram of sepal length grouped by species

In [543]:
@df iris groupedhist(:SepalLength, group = :Species, bar_position = :dodge)
Out[543]:

image.png

Scratching the surface in 45 minutes

All packages in Julia are very well documented

  • CSV.jl
  • XLSX.jl
  • JSON.jl
  • DataFrames.jl
  • Plots.jl
  • StatsPlots.jl