An Introduction to Ibis¶

Ibis aims to provide a single API for composing analytical queries from a variety of database 'backends' - e.g., PostgeSQL, PySpark, Pandas, etc. The API closely resembles the PySpark DataFrame API - i.e., the aim is to represent SQL expressions using Python code.

Imports and Configuration¶

In [1]:

Copied!

import ibis

ibis.options.interactive = True
import ibis

ibis.options.interactive = True

We set ibis.options.interactive = True to disable lazy evaluation, given that we're working in a notebook.

Connect to a Database¶

In [2]:

Copied!

connection = ibis.sqlite.connect("geography.sqlite")
connection.list_tables()
connection = ibis.sqlite.connect("geography.sqlite")
connection.list_tables()

Out[2]:

['countries', 'countries_gdp', 'gdp', 'independence']

Explore Tables¶

In [3]:

Copied!

countries = connection.table("countries")
countries
countries = connection.table("countries")
countries

Out[3]:

	iso_alpha2	iso_alpha3	iso_numeric	fips	name	capital	area_km2	population	continent
0	AD	AND	20	AN	Andorra	Andorra la Vella	468.0	84000	EU
1	AE	ARE	784	AE	United Arab Emirates	Abu Dhabi	82880.0	4975593	AS
2	AF	AFG	4	AF	Afghanistan	Kabul	647500.0	29121286	AS
3	AG	ATG	28	AC	Antigua and Barbuda	St. Johns	443.0	86754	NA
4	AI	AIA	660	AV	Anguilla	The Valley	102.0	13254	NA
...	...	...	...	...	...	...	...	...	...
247	YE	YEM	887	YM	Yemen	Sanaa	527970.0	23495361	AS
248	YT	MYT	175	MF	Mayotte	Mamoudzou	374.0	159042	AF
249	ZA	ZAF	710	SF	South Africa	Pretoria	1219912.0	49000000	AF
250	ZM	ZMB	894	ZA	Zambia	Lusaka	752614.0	13460305	AF
251	ZW	ZWE	716	ZI	Zimbabwe	Harare	390580.0	13061000	AF

252 rows × 9 columns

The table can be manipulated as if it were a dataframe.

In [4]:

Copied!

print(f"# rows = {countries.count()}")
print(f"# columns = {len(countries.columns)}")
print(f"# rows = {countries.count()}")
print(f"# columns = {len(countries.columns)}")

# rows = 252
# columns = 9

In [5]:

Copied!





asian_countries = (
    countries['name', 'continent', 'population']
    .filter(countries['continent'] == 'AS')
)
asian_countries.sort_by(ibis.desc("population")).limit(10)
asian_countries = (
    countries['name', 'continent', 'population']
    .filter(countries['continent'] == 'AS')
)
asian_countries.sort_by(ibis.desc("population")).limit(10)

Out[5]:

	name	continent	population
0	China	AS	1330044000
1	India	AS	1173108018
2	Indonesia	AS	242968342
3	Pakistan	AS	184404791
4	Bangladesh	AS	156118464
5	Japan	AS	127288000
6	Philippines	AS	99900177
7	Vietnam	AS	89571130
8	Turkey	AS	77804122
9	Iran	AS	76923300

Aggregations¶

In [6]:

Copied!





total_poplation = (
    countries
    .group_by("continent")
    .aggregate(
        countries["population"].sum().name("total_population")
    )
)
total_poplation
total_poplation = (
    countries
    .group_by("continent")
    .aggregate(
        countries["population"].sum().name("total_population")
    )
)
total_poplation

Out[6]:

	continent	total_population
0	AF	1021238685
1	AN	170
2	AS	4130584841
3	EU	750724554
4	NA	540204371
5	OC	36067549
6	SA	400143568

Joining Data¶

Start by loading another table.

In [7]:

Copied!

gdp = connection.table("gdp")
gdp
gdp = connection.table("gdp")
gdp

Out[7]:

	country_code	year	value
0	ABW	1986	4.054634e+08
1	ABW	1987	4.876025e+08
2	ABW	1988	5.964236e+08
3	ABW	1989	6.953044e+08
4	ABW	1990	7.648871e+08
...	...	...	...
9995	SVK	2002	3.513034e+10
9996	SVK	2003	4.681659e+10
9997	SVK	2004	5.733202e+10
9998	SVK	2005	6.278531e+10
9999	SVK	2006	7.070810e+10

10000 rows × 3 columns

Let's filter-out all by the most recent year's data.

In [8]:

Copied!





gdp_latest = (
    gdp
    .group_by("country_code")
    .order_by(ibis.desc("year"))
    .mutate(
        latest_year=gdp["year"].first(),
        latest_value=gdp["value"].first()
    )
    .select(["country_code", "latest_year", "latest_value"])
    .relabel({"latest_year": "year", "latest_value": "gdp"})
    .distinct()
)
gdp_latest
gdp_latest = (
    gdp
    .group_by("country_code")
    .order_by(ibis.desc("year"))
    .mutate(
        latest_year=gdp["year"].first(),
        latest_value=gdp["value"].first()
    )
    .select(["country_code", "latest_year", "latest_value"])
    .relabel({"latest_year": "year", "latest_value": "gdp"})
    .distinct()
)
gdp_latest

Out[8]:

	country_code	year	gdp
0	ABW	2017	2.700559e+09
1	AFG	2017	2.019176e+10
2	AGO	2017	1.221238e+11
3	ALB	2017	1.302506e+10
4	AND	2017	3.013387e+09
...	...	...	...
253	XKX	2017	7.227700e+09
254	YEM	2017	2.681870e+10
255	ZAF	2017	3.495541e+11
256	ZMB	2017	2.586814e+10
257	ZWE	2017	2.281301e+10

258 rows × 3 columns

Note, the above query is identicle to one with explicit window functions,

w = ibis.window(group_by=gdp["country_code"], order_by=ibis.desc(gdp["year"]))

latest_gdp = (
    gdp
    .mutate(
        latest_value=gdp["value"].first().over(w),
        latest_year=gdp["year"].first().over(w)
    )
    .select(["country_code", "latest_year", "latest_value"])
    .relabel({"latest_year": "year", "latest_value": "gdp"})
    .distinct()
)

Join with coutries table.

In [9]:

Copied!





countries_gdp = (
    countries
    .inner_join(gdp_latest, countries["iso_alpha3"] == gdp_latest["country_code"])
    .select(["country_code", "name", "continent", "population", "gdp"])
)

countries_gdp
countries_gdp = (
    countries
    .inner_join(gdp_latest, countries["iso_alpha3"] == gdp_latest["country_code"])
    .select(["country_code", "name", "continent", "population", "gdp"])
)

countries_gdp

Out[9]:

	country_code	name	continent	population	gdp
0	AND	Andorra	EU	84000	3.013387e+09
1	ARE	United Arab Emirates	AS	4975593	3.777011e+11
2	AFG	Afghanistan	AS	29121286	2.019176e+10
3	ATG	Antigua and Barbuda	NA	86754	1.467978e+09
4	ALB	Albania	EU	2986952	1.302506e+10
...	...	...	...	...	...
206	XKX	Kosovo	EU	1800000	7.227700e+09
207	YEM	Yemen	AS	23495361	2.681870e+10
208	ZAF	South Africa	AF	49000000	3.495541e+11
209	ZMB	Zambia	AF	13460305	2.586814e+10
210	ZWE	Zimbabwe	AF	13061000	2.281301e+10

211 rows × 5 columns

Reading and Writing Data¶

We can also create and tables (and likewise views) via Ibis.

In [10]:

Copied!





if "countries_gdp" in connection.list_tables():
    connection.drop_table("countries_gdp")
connection.create_table("countries_gdp", countries_gdp)
connection.list_tables()
if "countries_gdp" in connection.list_tables():
    connection.drop_table("countries_gdp")
connection.create_table("countries_gdp", countries_gdp)
connection.list_tables()

Out[10]:

['countries', 'countries_gdp', 'gdp', 'independence']

Working with Temporary Views and SQL¶

It also possible to express queries using SQL. This isn't implemented for the SQLite backend, but otherwise would look like,

continent_gdp = (
    countries_gdp
    .alias("countries_gdp")
    .sql("SELECT continent, sum(gdp) FROM countries_gdp GROUPBY continent")
)

Where the alias command creates a temporary view within the database.

Lazy Evaluation and Converting to Pandas DataFrame¶

If we make ibis.options.interactive = False, then Ibis will only return interim table metadata untill we call the execute method, much like Apache Spark.

The execute method can also be used with eager evaluation to return a dataframe.

In [11]:

Copied!

ibis.options.interactive = False

df = countries.execute()

print(f"type(df) = {type(df)}")
print("df:")
df
ibis.options.interactive = False

df = countries.execute()

print(f"type(df) = {type(df)}")
print("df:")
df

type(df) = <class 'pandas.core.frame.DataFrame'>
df:

Out[11]:

	iso_alpha2	iso_alpha3	iso_numeric	fips	name	capital	area_km2	population	continent
0	AD	AND	20	AN	Andorra	Andorra la Vella	468.0	84000	EU
1	AE	ARE	784	AE	United Arab Emirates	Abu Dhabi	82880.0	4975593	AS
2	AF	AFG	4	AF	Afghanistan	Kabul	647500.0	29121286	AS
3	AG	ATG	28	AC	Antigua and Barbuda	St. Johns	443.0	86754	NA
4	AI	AIA	660	AV	Anguilla	The Valley	102.0	13254	NA
...	...	...	...	...	...	...	...	...	...
247	YE	YEM	887	YM	Yemen	Sanaa	527970.0	23495361	AS
248	YT	MYT	175	MF	Mayotte	Mamoudzou	374.0	159042	AF
249	ZA	ZAF	710	SF	South Africa	Pretoria	1219912.0	49000000	AF
250	ZM	ZMB	894	ZA	Zambia	Lusaka	752614.0	13460305	AF
251	ZW	ZWE	716	ZI	Zimbabwe	Harare	390580.0	13061000	AF

252 rows × 9 columns