using Pkg
Pkg.add("IJulia")
using IJulia
#notebook()
Resolving package versions... No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml` No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`
Ctrl + Shift + p or Command + Shift + p
Navigation bar -> Help or press h if in Edit mode
using Pkg
Pkg.activate(".")
Activating environment at `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml`
f = open("output.txt", "w")
IOStream(<file output.txt>)
write(f, "This is a first line of text.\n")
30
write(f, "This is another line of text.\n")
30
close(f)
Many things can go wrong when reading or writing files, therfore it is good practise
to catch possible errors and to make sure the filestream is closed again:
f = open("output.txt", "r")
try
l = readline(f)
l = readline(f)
write(f, "last line of text.\n")
catch exc
println("Something went wrong: $exc")
finally
close(f)
end
Something went wrong: ArgumentError("write failed, IOStream is not writeable")
Another way of opening a stream for writing:
open("other_output.txt", "w") do f
write(f, "text for another file.")
end
22
Structure for universal data exchange.
Pkg.add("JSON");
using JSON
Resolving package versions... No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml` No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`
Create sample dictionary to hold the data
# dictionary to write
dict1 = Dict("key1" => 1, "key2" => 2,
"key3" => Dict("sub1"=>"some text","sub2"=>5.4335,"sub3"=>[3,5,7,3]))
Dict{String, Any} with 3 entries: "key2" => 2 "key3" => Dict{String, Any}("sub2"=>5.4335, "sub3"=>[3, 5, 7, 3], "sub1"=>"so… "key1" => 1
# cast dict into json format
stringdata = JSON.json(dict1)
"{\"key2\":2,\"key3\":{\"sub2\":5.4335,\"sub3\":[3,5,7,3],\"sub1\":\"some text\"},\"key1\":1}"
f = open("out_json.json", "w")
try
write(f, stringdata)
catch exc
println("Something went wrong: $exc")
finally
close(f)
end
78
jDict = Dict()
open("out_json.json", "r") do f
global jDict
jDict=JSON.parse(f) # parse and transform data
end
Dict{String, Any} with 3 entries: "key2" => 2 "key3" => Dict{String, Any}("sub2"=>5.4335, "sub3"=>Any[3, 5, 7, 3], "sub1"=>… "key1" => 1
For other IO (Multimedia, Network) there is extensive documentation:
https://docs.julialang.org/en/v1/base/io-network/
Pkg.add("CSV");
Pkg.add("XLSX");
Pkg.add("DataFrames");
Resolving package versions... No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml` No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml` Resolving package versions... No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml` No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml` Resolving package versions... No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml` No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`
using CSV
using DataFrames
using XLSX
xf = XLSX.readxlsx("demo_data.xlsx")
XLSXFile("demo_data.xlsx") containing 1 Worksheet sheetname size range ------------------------------------------------- Sheet1 6x3 A1:C6
XLSX.sheetnames(xf)
1-element Vector{String}: "Sheet1"
sh = xf["Sheet1"]
typeof(sh)
XLSX.Worksheet
Selecting all data from Sheet
sh[:]
6×3 Matrix{Any}: "Name" "Salary" "Entry" "Jon Doo" 75000 2010-10-23 "Rose Bush" 60000 2013-03-04 "Anne Teak" 83000 2007-11-12 "Rod Knee" 55000 2019-08-23 "Hank R. Cheef" 98000 2001-04-29
# Selecting only 2 rows and 2 columns
sh["A1:B3"]
3×2 Matrix{Any}: "Name" "Salary" "Jon Doo" 75000 "Rose Bush" 60000
Lets cast the Matrix into a DataFrame
# Leave the first row out (column names)
hr = sh["A2:C6"] |> DataFrame
x1 | x2 | x3 | |
---|---|---|---|
Any | Any | Any | |
1 | Jon Doo | 75000 | 2010-10-23 |
2 | Rose Bush | 60000 | 2013-03-04 |
3 | Anne Teak | 83000 | 2007-11-12 |
4 | Rod Knee | 55000 | 2019-08-23 |
5 | Hank R. Cheef | 98000 | 2001-04-29 |
# Manually rename the columns
rename!(hr, [:Name, :Salary, :Entry_Date])
Name | Salary | Entry_Date | |
---|---|---|---|
Any | Any | Any | |
1 | Jon Doo | 75000 | 2010-10-23 |
2 | Rose Bush | 60000 | 2013-03-04 |
3 | Anne Teak | 83000 | 2007-11-12 |
4 | Rod Knee | 55000 | 2019-08-23 |
5 | Hank R. Cheef | 98000 | 2001-04-29 |
We can combine all these steps into one:
hr = DataFrame(XLSX.readtable("demo_data.xlsx", "Sheet1")...)
Name | Salary | Entry | |
---|---|---|---|
Any | Any | Any | |
1 | Jon Doo | 75000 | 2010-10-23 |
2 | Rose Bush | 60000 | 2013-03-04 |
3 | Anne Teak | 83000 | 2007-11-12 |
4 | Rod Knee | 55000 | 2019-08-23 |
5 | Hank R. Cheef | 98000 | 2001-04-29 |
typeof(hr)
DataFrame
CSV.write("HR_Sheet.csv", hr)
"HR_Sheet.csv"
hr_csv = CSV.File("HR_Sheet.csv")
5-element CSV.File{false}: CSV.Row: (Name = "Jon Doo", Salary = 75000, Entry = Date("2010-10-23")) CSV.Row: (Name = "Rose Bush", Salary = 60000, Entry = Date("2013-03-04")) CSV.Row: (Name = "Anne Teak", Salary = 83000, Entry = Date("2007-11-12")) CSV.Row: (Name = "Rod Knee", Salary = 55000, Entry = Date("2019-08-23")) CSV.Row: (Name = "Hank R. Cheef", Salary = 98000, Entry = Date("2001-04-29"))
# Select the Name field of the first CSV row
hr_csv[1][:Name]
"Jon Doo"
hr_csv = CSV.File("HR_Sheet.csv", header=true) |> DataFrame
Name | Salary | Entry | |
---|---|---|---|
String | Int64 | Date | |
1 | Jon Doo | 75000 | 2010-10-23 |
2 | Rose Bush | 60000 | 2013-03-04 |
3 | Anne Teak | 83000 | 2007-11-12 |
4 | Rod Knee | 55000 | 2019-08-23 |
5 | Hank R. Cheef | 98000 | 2001-04-29 |
hr_departments = DataFrame(XLSX.readtable("demo_data2.xlsx", "Sheet1")...)
Name | Department | |
---|---|---|
Any | Any | |
1 | Jon Doo | HR |
2 | Rose Bush | R&D |
3 | Anne Teak | R&D |
4 | Rod Knee | Marketing |
5 | Hank R. Cheef | GM |
hr_departments = DataFrame(XLSX.readtable("demo_data2.xlsx", "Sheet1", infer_eltypes=true)...)
Name | Department | |
---|---|---|
String | String | |
1 | Jon Doo | HR |
2 | Rose Bush | R&D |
3 | Anne Teak | R&D |
4 | Rod Knee | Marketing |
5 | Hank R. Cheef | GM |
Let's join both tables into one:
hr_all = innerjoin(hr_csv, hr_departments, on=:Name)
Name | Salary | Entry | Department | |
---|---|---|---|---|
String | Int64 | Date | String | |
1 | Jon Doo | 75000 | 2010-10-23 | HR |
2 | Rose Bush | 60000 | 2013-03-04 | R&D |
3 | Anne Teak | 83000 | 2007-11-12 | R&D |
4 | Rod Knee | 55000 | 2019-08-23 | Marketing |
5 | Hank R. Cheef | 98000 | 2001-04-29 | GM |
Selecting a single column:
hr_all.Name
5-element Vector{String}: "Jon Doo" "Rose Bush" "Anne Teak" "Rod Knee" "Hank R. Cheef"
hr_all[!, :Name]
5-element Vector{String}: "Jon Doo" "Rose Bush" "Anne Teak" "Rod Knee" "Hank R. Cheef"
# Our condition:
hr_all.Salary .> 75000
5-element BitVector: 0 0 1 0 1
hr_all[hr_all.Salary .> 75000, :]
Name | Salary | Entry | Department | |
---|---|---|---|---|
String | Int64 | Date | String | |
1 | Anne Teak | 83000 | 2007-11-12 | R&D |
2 | Hank R. Cheef | 98000 | 2001-04-29 | GM |
hr_all[!, [:Name, :Entry]]
Name | Entry | |
---|---|---|
String | Date | |
1 | Jon Doo | 2010-10-23 |
2 | Rose Bush | 2013-03-04 |
3 | Anne Teak | 2007-11-12 |
4 | Rod Knee | 2019-08-23 |
5 | Hank R. Cheef | 2001-04-29 |
Pkg.add("Query");
using Query;
using Statistics
Resolving package versions... No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml` No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`
Pipe operator "|>" is used to feed the output of an operation into the input of the next function, e.g. select, filter and sub-set dataframes:
The command from above:
hr_all[hr_all.Salary .> 75000, :]
can be written as:
hr_all |> @filter(_.Salary > 75000)
Name | Salary | Entry | Department |
---|---|---|---|
"Anne Teak" | 83000 | Date("2007-11-12") | "R&D" |
"Hank R. Cheef" | 98000 | Date("2001-04-29") | "GM" |
This way we can pipe easily multiple commands:
top = hr_all |>
@select(:Name, :Salary, :Department) |>
@filter(_.Salary > 75000) |>
DataFrame
Name | Salary | Department | |
---|---|---|---|
String | Int64 | String | |
1 | Anne Teak | 83000 | R&D |
2 | Hank R. Cheef | 98000 | GM |
df2 = hr_all |>
@groupby(_.Department) |>
@map({Department=key(_), Salary=mean(_.Salary)}) |>
DataFrame
Department | Salary | |
---|---|---|
String | Float64 | |
1 | HR | 75000.0 |
2 | R&D | 71500.0 |
3 | Marketing | 55000.0 |
4 | GM | 98000.0 |
Detailed description of parameters for import and export can be found in the package documentation:
https://csv.juliadata.org/stable/
https://felipenoris.github.io/XLSX.jl/stable/
For example in CSV.jl you can normalize the column names to remove unwanted characters
as whitespaces etc.
d=download("https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KCLT.csv")
"/var/folders/p5/vnkq8n3d0s9_ftw6ktg1b4bcrqfp_x/T/jl_aOviFf"
weather = CSV.File(d) |> DataFrame
describe(weather)
variable | mean | min | median | max | nmissing | eltype | |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | date | 2014-07-01 | 2015-06-30 | 0 | Date | ||
2 | actual_mean_temp | 61.0493 | 18 | 63.0 | 88 | 0 | Int64 |
3 | actual_min_temp | 49.9589 | 7 | 52.0 | 75 | 0 | Int64 |
4 | actual_max_temp | 71.6301 | 26 | 73.0 | 100 | 0 | Int64 |
5 | average_min_temp | 48.8192 | 29 | 48.0 | 68 | 0 | Int64 |
6 | average_max_temp | 70.9836 | 50 | 72.0 | 89 | 0 | Int64 |
7 | record_min_temp | 31.4658 | -5 | 30.0 | 62 | 0 | Int64 |
8 | record_max_temp | 88.7288 | 69 | 90.0 | 104 | 0 | Int64 |
9 | record_min_temp_year | 1953.28 | 1879 | 1963.0 | 2015 | 0 | Int64 |
10 | record_max_temp_year | 1953.99 | 1879 | 1953.0 | 2015 | 0 | Int64 |
11 | actual_precipitation | 0.102411 | 0.0 | 0.0 | 2.65 | 0 | Float64 |
12 | average_precipitation | 0.114082 | 0.09 | 0.11 | 0.15 | 0 | Float64 |
13 | record_precipitation | 2.2089 | 0.85 | 1.98 | 6.88 | 0 | Float64 |
Pkg.add("RDatasets");
using RDatasets;
Resolving package versions... No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml` No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`
# Loading the Iris dataset:
iris = dataset("datasets", "iris");
size(iris)
(150, 5)
Explore the table by using first(<data>, nrows)
first(iris, 3)
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Cat… | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
# Using sclicing
iris[1:3, :]
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Cat… | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
describe(iris)
variable | mean | min | median | max | nmissing | eltype | |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | SepalLength | 5.84333 | 4.3 | 5.8 | 7.9 | 0 | Float64 |
2 | SepalWidth | 3.05733 | 2.0 | 3.0 | 4.4 | 0 | Float64 |
3 | PetalLength | 3.758 | 1.0 | 4.35 | 6.9 | 0 | Float64 |
4 | PetalWidth | 1.19933 | 0.1 | 1.3 | 2.5 | 0 | Float64 |
5 | Species | setosa | virginica | 0 | CategoricalValue{String, UInt8} |
using Plots.jl
Pkg.add("Plots")
using Plots
Resolving package versions... No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml` No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`
plot(1:10, rand(10))
plot!(rand(10).+1)
plot([sin,cos], 0:0.1:Ï€)
x = 1:10
y = rand(10, 2)
plot(x, y, title = "Two Lines", label = ["Line 1" "Line 2"], lw = 3)
Plot.jl uses a default backend for plotting
Switch to interactive plotting:
plotly()
Plots.PlotlyBackend()
plot(x, y, title = "Two Lines", label = ["Line 1" "Line 2"], lw = 3)
gr()
p1 = plot(x, y, xlabel = "X Axis", title = "Lineplot") # Line plot
p2 = scatter(x, y) # Scatter plot
p3 = plot(x, y, lw = 3) # Another line plot
p4 = histogram(x, y) # Histogram
plot(p1, p2, p3, p4, layout = (2, 2), legend = false)
Pkg.add("StatsPlots")
using StatsPlots
Resolving package versions... No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Project.toml` No Changes to `~/work/Julia/tutorials/Elixir_training_2021-04-22/Manifest.toml`
Scatter plot the petal length vs. petal width
@df iris scatter(:PetalLength, :PetalWidth)
@df iris corrplot([:SepalLength :SepalWidth :PetalLength :PetalWidth], grid = false)
@df iris groupedhist(:SepalLength, group = :Species, bar_position = :dodge)
All packages in Julia are very well documented