An Introduction to Ibis¶
Ibis aims to provide a single API for composing analytical queries from a variety of database 'backends' - e.g., PostgeSQL, PySpark, Pandas, etc. The API closely resembles the PySpark DataFrame API - i.e., the aim is to represent SQL expressions using Python code.
Imports and Configuration¶
import ibis
ibis.options.interactive = True
We set ibis.options.interactive = True
to disable lazy evaluation, given that we're working in a notebook.
Connect to a Database¶
connection = ibis.sqlite.connect("geography.sqlite")
connection.list_tables()
['countries', 'countries_gdp', 'gdp', 'independence']
Explore Tables¶
countries = connection.table("countries")
countries
iso_alpha2 | iso_alpha3 | iso_numeric | fips | name | capital | area_km2 | population | continent | |
---|---|---|---|---|---|---|---|---|---|
0 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 84000 | EU |
1 | AE | ARE | 784 | AE | United Arab Emirates | Abu Dhabi | 82880.0 | 4975593 | AS |
2 | AF | AFG | 4 | AF | Afghanistan | Kabul | 647500.0 | 29121286 | AS |
3 | AG | ATG | 28 | AC | Antigua and Barbuda | St. Johns | 443.0 | 86754 | NA |
4 | AI | AIA | 660 | AV | Anguilla | The Valley | 102.0 | 13254 | NA |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
247 | YE | YEM | 887 | YM | Yemen | Sanaa | 527970.0 | 23495361 | AS |
248 | YT | MYT | 175 | MF | Mayotte | Mamoudzou | 374.0 | 159042 | AF |
249 | ZA | ZAF | 710 | SF | South Africa | Pretoria | 1219912.0 | 49000000 | AF |
250 | ZM | ZMB | 894 | ZA | Zambia | Lusaka | 752614.0 | 13460305 | AF |
251 | ZW | ZWE | 716 | ZI | Zimbabwe | Harare | 390580.0 | 13061000 | AF |
252 rows × 9 columns
The table can be manipulated as if it were a dataframe.
print(f"# rows = {countries.count()}")
print(f"# columns = {len(countries.columns)}")
# rows = 252 # columns = 9
asian_countries = (
countries['name', 'continent', 'population']
.filter(countries['continent'] == 'AS')
)
asian_countries.sort_by(ibis.desc("population")).limit(10)
name | continent | population | |
---|---|---|---|
0 | China | AS | 1330044000 |
1 | India | AS | 1173108018 |
2 | Indonesia | AS | 242968342 |
3 | Pakistan | AS | 184404791 |
4 | Bangladesh | AS | 156118464 |
5 | Japan | AS | 127288000 |
6 | Philippines | AS | 99900177 |
7 | Vietnam | AS | 89571130 |
8 | Turkey | AS | 77804122 |
9 | Iran | AS | 76923300 |
Aggregations¶
total_poplation = (
countries
.group_by("continent")
.aggregate(
countries["population"].sum().name("total_population")
)
)
total_poplation
continent | total_population | |
---|---|---|
0 | AF | 1021238685 |
1 | AN | 170 |
2 | AS | 4130584841 |
3 | EU | 750724554 |
4 | NA | 540204371 |
5 | OC | 36067549 |
6 | SA | 400143568 |
Joining Data¶
Start by loading another table.
gdp = connection.table("gdp")
gdp
country_code | year | value | |
---|---|---|---|
0 | ABW | 1986 | 4.054634e+08 |
1 | ABW | 1987 | 4.876025e+08 |
2 | ABW | 1988 | 5.964236e+08 |
3 | ABW | 1989 | 6.953044e+08 |
4 | ABW | 1990 | 7.648871e+08 |
... | ... | ... | ... |
9995 | SVK | 2002 | 3.513034e+10 |
9996 | SVK | 2003 | 4.681659e+10 |
9997 | SVK | 2004 | 5.733202e+10 |
9998 | SVK | 2005 | 6.278531e+10 |
9999 | SVK | 2006 | 7.070810e+10 |
10000 rows × 3 columns
Let's filter-out all by the most recent year's data.
gdp_latest = (
gdp
.group_by("country_code")
.order_by(ibis.desc("year"))
.mutate(
latest_year=gdp["year"].first(),
latest_value=gdp["value"].first()
)
.select(["country_code", "latest_year", "latest_value"])
.relabel({"latest_year": "year", "latest_value": "gdp"})
.distinct()
)
gdp_latest
country_code | year | gdp | |
---|---|---|---|
0 | ABW | 2017 | 2.700559e+09 |
1 | AFG | 2017 | 2.019176e+10 |
2 | AGO | 2017 | 1.221238e+11 |
3 | ALB | 2017 | 1.302506e+10 |
4 | AND | 2017 | 3.013387e+09 |
... | ... | ... | ... |
253 | XKX | 2017 | 7.227700e+09 |
254 | YEM | 2017 | 2.681870e+10 |
255 | ZAF | 2017 | 3.495541e+11 |
256 | ZMB | 2017 | 2.586814e+10 |
257 | ZWE | 2017 | 2.281301e+10 |
258 rows × 3 columns
Note, the above query is identicle to one with explicit window functions,
w = ibis.window(group_by=gdp["country_code"], order_by=ibis.desc(gdp["year"]))
latest_gdp = (
gdp
.mutate(
latest_value=gdp["value"].first().over(w),
latest_year=gdp["year"].first().over(w)
)
.select(["country_code", "latest_year", "latest_value"])
.relabel({"latest_year": "year", "latest_value": "gdp"})
.distinct()
)
Join with coutries table.
countries_gdp = (
countries
.inner_join(gdp_latest, countries["iso_alpha3"] == gdp_latest["country_code"])
.select(["country_code", "name", "continent", "population", "gdp"])
)
countries_gdp
country_code | name | continent | population | gdp | |
---|---|---|---|---|---|
0 | AND | Andorra | EU | 84000 | 3.013387e+09 |
1 | ARE | United Arab Emirates | AS | 4975593 | 3.777011e+11 |
2 | AFG | Afghanistan | AS | 29121286 | 2.019176e+10 |
3 | ATG | Antigua and Barbuda | NA | 86754 | 1.467978e+09 |
4 | ALB | Albania | EU | 2986952 | 1.302506e+10 |
... | ... | ... | ... | ... | ... |
206 | XKX | Kosovo | EU | 1800000 | 7.227700e+09 |
207 | YEM | Yemen | AS | 23495361 | 2.681870e+10 |
208 | ZAF | South Africa | AF | 49000000 | 3.495541e+11 |
209 | ZMB | Zambia | AF | 13460305 | 2.586814e+10 |
210 | ZWE | Zimbabwe | AF | 13061000 | 2.281301e+10 |
211 rows × 5 columns
Reading and Writing Data¶
We can also create and tables (and likewise views) via Ibis.
if "countries_gdp" in connection.list_tables():
connection.drop_table("countries_gdp")
connection.create_table("countries_gdp", countries_gdp)
connection.list_tables()
['countries', 'countries_gdp', 'gdp', 'independence']
Working with Temporary Views and SQL¶
It also possible to express queries using SQL. This isn't implemented for the SQLite backend, but otherwise would look like,
continent_gdp = (
countries_gdp
.alias("countries_gdp")
.sql("SELECT continent, sum(gdp) FROM countries_gdp GROUPBY continent")
)
Where the alias
command creates a temporary view within the database.
Lazy Evaluation and Converting to Pandas DataFrame¶
If we make ibis.options.interactive = False
, then Ibis will only return interim table metadata untill we call the execute
method, much like Apache Spark.
The execute
method can also be used with eager evaluation to return a dataframe.
ibis.options.interactive = False
df = countries.execute()
print(f"type(df) = {type(df)}")
print("df:")
df
type(df) = <class 'pandas.core.frame.DataFrame'> df:
iso_alpha2 | iso_alpha3 | iso_numeric | fips | name | capital | area_km2 | population | continent | |
---|---|---|---|---|---|---|---|---|---|
0 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 84000 | EU |
1 | AE | ARE | 784 | AE | United Arab Emirates | Abu Dhabi | 82880.0 | 4975593 | AS |
2 | AF | AFG | 4 | AF | Afghanistan | Kabul | 647500.0 | 29121286 | AS |
3 | AG | ATG | 28 | AC | Antigua and Barbuda | St. Johns | 443.0 | 86754 | NA |
4 | AI | AIA | 660 | AV | Anguilla | The Valley | 102.0 | 13254 | NA |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
247 | YE | YEM | 887 | YM | Yemen | Sanaa | 527970.0 | 23495361 | AS |
248 | YT | MYT | 175 | MF | Mayotte | Mamoudzou | 374.0 | 159042 | AF |
249 | ZA | ZAF | 710 | SF | South Africa | Pretoria | 1219912.0 | 49000000 | AF |
250 | ZM | ZMB | 894 | ZA | Zambia | Lusaka | 752614.0 | 13460305 | AF |
251 | ZW | ZWE | 716 | ZI | Zimbabwe | Harare | 390580.0 | 13061000 | AF |
252 rows × 9 columns