SQL migration files and goose annotations

In this post we'll explore SQL migration files and +goose annotation comments, which are used to parse SQL statements and optionally modify how migrations are executed.

As of this writing there are five annotations:

-- +goose Up
-- +goose Down
-- +goose StatementBegin
-- +goose StatementEnd
-- +goose NO TRANSACTION

bonus

In addition to SQL migration files, the goose package can be used to write Go-based migrations and track both SQL and Go migrations in the same way.

See repository example for Go migrations. We'll do a deep dive in a future post. Stay tuned!

Quick start

Here's a copy/pasteable .sql migration file to get started:

-- +goose Up
SELECT 'up SQL query';

-- +goose Down
SELECT 'down SQL query';

Remember, annotations are captured as comments and cannot have leading spaces:

-- +goose Up ✅
    -- +goose Up ❌ (error because leading whitespace)

The basics

A SQL migration file must have a .sql extension and is prefixed with either a timestamp or a sequential number.

There is a handy goose create command to stub out migration files in a consistent way:

timestamp

$ goose -dir migrations create add_users_table sql
Created new file: migrations/20230201093158_add_users_table.sql

sequential

$ goose -dir migrations -s create add_users_table sql # (1)! Created new file:
migrations/00001_add_users_table.sql

The -s flag instructs goose to create new migration files in sequential order. Timestamp is the default.

A SQL migration file can have both Up and Down migrations. For the curious, there is an open issue ( #374) requesting support for migrations to be split in separate files.

Each SQL migration file is expected to have exactly one -- +goose Up annotation.

The -- +goose Down annotation is optional, but recommended, and must come after the Up annotation within the file. Example:

-- +goose Up
SELECT 'up SQL query 1';
SELECT 'up SQL query 2';
SELECT 'up SQL query 3';

-- +goose Down (1)
SELECT 'down SQL query 1';
SELECT 'down SQL query 2';

The -- +goose Down annotation is optional, and may be omitted entirely if there are no down migrations. Within the .sql file it must come after the -- +goose Up annotation.

This is invalid:
```
-- +goose Down
SELECT 'down SQL query';

-- +goose Up
SELECT 'up SQL query';
```

Any statements following -- +goose Up will be executed as part of an up migration, and any statements following -- +goose Down will be executed as part of a down migration.

Complex statements

By default, SQL statements are delimited by semicolons - in fact, query statements must end with a semicolon to be properly recognized by goose.

More complex statements (PL/pgSQL) that have semicolons within them must be annotated with -- +goose StatementBegin and -- +goose StatementEnd to be properly parsed. For example:

-- +goose Up

-- +goose StatementBegin
CREATE OR REPLACE FUNCTION histories_partition_creation( DATE, DATE )
returns void AS $$
DECLARE
  create_query text;
BEGIN
-- This comment will be preserved.
  -- And so will this one.
  FOR create_query IN SELECT
      'CREATE TABLE IF NOT EXISTS histories_'
      || TO_CHAR( d, 'YYYY_MM' )
      || ' ( CHECK( created_at >= timestamp '''
      || TO_CHAR( d, 'YYYY-MM-DD 00:00:00' )
      || ''' AND created_at < timestamp '''
      || TO_CHAR( d + INTERVAL '1 month', 'YYYY-MM-DD 00:00:00' )
      || ''' ) ) inherits ( histories );'
    FROM generate_series( $1, $2, '1 month' ) AS d
  LOOP
    EXECUTE create_query;
  END LOOP;  -- LOOP END
END;         -- FUNCTION END
$$
language plpgsql;
-- +goose StatementEnd

When goose detects a -- +goose StatementBegin annotation it continues parsing statement(s), ignoring semicolons, until -- +goose StatementEnd is detected. The resulting statement is stripped of leading and trailing comments / empty lines.

Comments and empty lines within the statement are preserved!

Multiple statements

But that's not all, the Begin and End annotations can be used to combine multiple statements so they get sent as a single command instead of being sent one-by-one.

This is best illustrated with a contrived example. Suppose we have a migration that creates a users table and adds 100,000 rows with distinct INSERT's.

-- +goose Up
CREATE TABLE users (
    id int NOT NULL PRIMARY KEY,
    username text,
    name text,
    surname text
);

-- (1)! Inserts:
INSERT INTO "users" ("id", "username", "name", "surname") VALUES (1, 'gallant_almeida7', 'Gallant', 'Almeida7');
INSERT INTO "users" ("id", "username", "name", "surname") VALUES (2, 'brave_spence8', 'Brave', 'Spence8');
.
.
INSERT INTO "users" ("id", "username", "name", "surname") VALUES (99999, 'jovial_chaum1', 'Jovial', 'Chaum1');
INSERT INTO "users" ("id", "username", "name", "surname") VALUES (100000, 'goofy_ptolemy0', 'Goofy', 'Ptolemy0');

-- +goose Down
DROP TABLE users;

This is a contrived example. Normally this would be a set of batched INSERT's with multiple column values, each enclosed with parentheses and separated by commas, like so:

INSERT INTO "users" ("id", "username", "name", "surname")
VALUES
  (1, 'gallant_almeida7', 'Gallant', 'Almeida7'),
  (2, 'brave_spence8', 'Brave', 'Spence8');

The Up migration contains 100,001 unique statements, all executed within the same transaction, but sent to the database one-by-one. This migration will take ~38s to complete due to the number of round trips.

Using PostgreSQL as an example, here's what the database logs show:

LOG:  statement: INSERT INTO "users" ("id", "username", "name", "surname") VALUES (1, 'gallant_almeida7', 'Gallant', 'Almeida7');
LOG:  statement: INSERT INTO "users" ("id", "username", "name", "surname") VALUES (2, 'brave_spence8', 'Brave', 'Spence8');
LOG:  statement: INSERT INTO "users" ("id", "username", "name", "surname") VALUES (3, 'lucid_bardeen6', 'Lucid', 'Bardeen6');
[...] 100,000 log statements

However, if we want to combine the inserts into a single command, we can wrap them with -- +goose StatementBegin and -- +goose StatementEnd annotations. Note, a single command still contains several statements separated by semicolons, but they get sent to the server in the same query string like so: "INSERT INTO ...; INSERT INTO ...;".

-- +goose Up
CREATE TABLE users (
    id int NOT NULL PRIMARY KEY,
    username text,
    name text,
    surname text
);

-- +goose StatementBegin
INSERT INTO "users" ("id", "username", "name", "surname") VALUES (1, 'gallant_almeida7', 'Gallant', 'Almeida7');
INSERT INTO "users" ("id", "username", "name", "surname") VALUES (2, 'brave_spence8', 'Brave', 'Spence8');
.
.
INSERT INTO "users" ("id", "username", "name", "surname") VALUES (99999, 'jovial_chaum1', 'Jovial', 'Chaum1');
INSERT INTO "users" ("id", "username", "name", "surname") VALUES (100000, 'goofy_ptolemy0', 'Goofy', 'Ptolemy0');
-- +goose StatementEnd

-- +goose Down
DROP TABLE users;

These annotations instruct goose to send a single command, which now consists of multiples statements delimited by semicolons, in one shot.

Yes, that's a larger payload, but that's fine and the migration will execute in ~3s, which is an order of magnitude faster as compared to the previous example that ran in ~38s.

Migrations outside transaction

All statements within a migration file are run within a transaction. Some statements, like CREATE DATABASE or CREATE INDEX CONCURRENTLY, cannot be run within a transaction block.

For such cases add the -- +goose NO TRANSACTION annotation, usually placed at the top of the file.

This annotation instructs goose to run all statements within the file without transactions. This applies to all Up and Down statements within the file.

-- +goose NO TRANSACTION

-- +goose Up
CREATE INDEX CONCURRENTLY ON users (user_id);

-- +goose Down
DROP INDEX IF EXISTS users_user_id_idx;