I don’t expect this will provide satisfactory query performance, but it’s so easy to try that we may as well: Virtual tables are a lot like Postgres’s foreign data wrappers. SQLite supports exposing a CSV table in-place via the CSV virtual table extension. Can we do better? CSV virtual table module Postgres was very fast to slurp the data in and fast to query once we set up indexes, but it still resulted in a 1419 MB database. This is because enumerated types are much cheaper to compare than strings.Įven better, creating an index on geo_name only takes 263 MB (vs 400 MB). Now our query on the unindexed-but-typed table takes only 550 ms (vs the unindexed-and-untyped table’s 720 ms). Two things are hurting us: Postgres has a relatively high minimum per-row cost of about 40 bytes 2, and representing a small number like 280 as a number rather than a string is actually more storage intensive (in exchange, query-time arithmetic will be faster). This is an improvement, but still a bit of a disappointment. \d+ shows that the newly-typed table is only 1147 MB (vs the untyped table’s 1438 MB). (CASE WHEN female IN ('.', '.', 'x', 'F') THEN NULL ELSE female END)::real AS female
(CASE WHEN male IN ('.', '.', 'x', 'F') THEN NULL ELSE male END)::real AS male, (CASE WHEN total IN ('.', '.', 'x', 'F') THEN NULL ELSE total END)::real AS total, (CASE WHEN notes = '' THEN NULL ELSE notes END)::smallint AS notes, (CASE WHEN gnr_lf = '' THEN NULL ELSE gnr_lf END)::real AS gnr_lf, (CASE WHEN gnr = '' THEN NULL ELSE gnr END)::real AS gnr, String_agg(DISTINCT quote_literal(md5(profile)), ', ')) SELECT format('CREATE TYPE profile AS ENUM (%s)', String_agg(DISTINCT quote_literal(geo_name), ', ')) SELECT format('CREATE TYPE geo_name AS ENUM (%s)', I’m lazy, so we’ll bootstrap the enums and strongly-typed table from the existing strings table: I’ll also start storing numeric values as numeric values, rather than strings. That’s a lot of work for a blog post, so I’m going to compromise by using Postgres’s ENUM types. Usually, you do this by creating separate lookup tables for each set of values. A better approach would be to normalize the data so that, rather than storing thousands of copies of the same value, we store one value and just point to it. As you can imagine, that’s pretty wasteful. Total - Lone-parent census families in private households - 100% data appears 5,469 times.
For example, the string Dawson Creek appears 2,247 times. As such, it repeats a lot of the same values. If we add an index with CREATE INDEX ON strings(geo_name), the query takes only 10ms – but the index adds an extra 400 MB! Strongly typed The query is a bit slow, though – almost a second. Not bad! No surprise: my rural hometown of Dawson Creek has almost no cyclists, the university town I live in has some, and the hippie enclave of Victoria has double digit percentages. Geo_name | male | female | pct_male | pct_female WHERE total.mode = 'total' AND cyclist.mode = 'cyclist' (100.0 * cyclist.female::int / total.female::int)::numeric(4,2) AS pct_female Geo_name IN ('Victoria', 'Dawson Creek', 'Kitchener') In addition to being large, grepping for a single record is slow:ĬASE WHEN profile_id = '1930' THEN 'total' ELSE 'cyclist' END AS mode, Ideally, it’d be compact and could run on a dirt-cheap shared web host.įor my purposes, a 1291 MB CSV was too cumbersome. My end goal is to make a standalone web site that lets you explore the data. I wanted to find out what percentage of people travelled by bicycle in three particular cities. Over 5,000 census areas are covered, with 2,200 dimensions in each. The CSV has 12,000,000 rows, each capturing metrics for men and women in a census area. The raw data from Stats Canada is a 1291 MB CSV 1. I was playing around with a project to visualize data from the 2016 Canada census. Go to cldellow/sqlite-parquet-vtable if you just want the code. This blog post explains the motivation for the creation of a SQLite virtual table extension for Parquet files. Or, hire me to help you with your data problems. Hello, reader! If you work with big files and S3, you may find my s3patch utility useful.