FactSet DuckDB
Run duckdb -ui to launch DuckDB UI in your default browser. You can run SQL queries there.
Click “Attached databases” to add a new database connection.
Fill in the connection details:
- Path: path to your DuckDB database file (usually ends with
.duckdb) - Alias (Optional): a nickname for the database connection, e.g.
FactSet
Alternatively, you can also launch the UI directly with a specific database connection using the following command:
duckdb "/Users/menghan/Desktop/FactSet Database/factset_single.duckdb" -ui
This will load the specified database in the UI automatically.
Pros: Nice UI for table overview, column browsing, and query execution.
Cons: Cannot save query notebooks to local. → Use Jupyter Notebook / Python for better organization and version control.
You can run SQL queries in the Notebooks. To quit the UI, go to terminal and hit Ctrl + D.
When you have built a connection in Python, it can be convenient if you open a web UI to view the database schema and run some quick queries. However, if you have an active connection to the database in Python, the web UI will be locked in read-only mode to prevent conflicts. This means you can only open the web UI in read-only mode.
Use the following command to open the UI in read-only mode:
duckdb "/Users/menghan/Desktop/FactSet Database/factset_single.duckdb" -readonly -ui
ref:
-
DuckDB SQL is a variant of PostgreSQL.
Keyboard shortcuts
| Action | Shortcut |
|---|---|
| Run Query | ⌘ Enter |
| Format Query | ⌘ ⇧ F |
| Clear Query Results | ⌘ ⇧ K |
Build a Python notebook to connect to DuckDB and run SQL queries. This way, you keep your SQL queries and data analysis in one place.
The notebooks in the web UI persiste btw sessions in the same browser on the same machine. But it’s better to use local Jupyter notebooks for better organization and version control.
Here is a Google Colab notebook example connecting to FactSet DuckDB database.
Jupyter Notebook
Create a new Jupyter notebook. In VS Code, go to Command Palette (⇧⌘P) → “Create: New Jupyter Notebook”.
See HERE for basic usage of Jupyter notebooks in VS Code.
Don’t use OneDrive to store you
.ipynbfiles, it sync prevents notebooks from saving its kernel state properly.OneDrive is rubbish. MS apps should always be avoided if possible…
Set up environment by installing necessary packages:
%conda uninstall --quiet --yes malloy
%conda install --quiet --yes --upgrade duckdb
%conda install --quiet --yes jupysql==0.11.1
%conda install --quiet --yes duckdb-engine
%conda install --quiet --yes pandas
%conda install --quiet --yes matplotlib
It will install duckdb, jupysql, duckdb-engine, pandas, matplotlib packages. It takes a while to install all packages.
jupysql allows you to run SQL and plot large datasets in Jupyter via a %sql, %%sql, and %sqlplotmagics.
Build the connection
# Connect to file backed db in read-only mode to avoid lock conflicts
conn = duckdb.connect("~/Desktop/FactSet Database/factset_single.duckdb", read_only=True)
# use the DuckDB connection
%sql conn
Querying DuckDB
Single line SQL queries can be run using %sql at the start of a line.
%sql SELECT 'Off and flying!' AS a_duckdb_column;
Save result to a variable
%sql res << SELECT 'Off and flying!' AS a_duckdb_column;
An entire Jupyter cell can be used as a SQL cell by starting with %%sql
%%sql
-- look up price based on tickers
SELECT price.*, sym.ticker_region, cov.proper_name, p.one_day_pct as DailyReturns
FROM fp_v2.fp_total_returns_daily AS p
LEFT JOIN fp_v2.fp_basic_prices AS price
ON price.fsym_id = p.fsym_id AND p.p_date = price.p_date
LEFT JOIN sym_v1.sym_ticker_region AS sym
ON p.fsym_id = sym.fsym_id
JOIN sym_v1.sym_coverage AS cov
ON sym.fsym_id = cov.fsym_id
WHERE sym.ticker_region IN ('AAPL-US','IBM-US','SPY-US','MSFT-US','GOOGL-US') -- change tickers
AND p.p_date BETWEEN '2021-08-16' AND '2021-08-20' -- change date scope
ORDER BY p.p_date, sym.ticker_region
Query results will be displayed as a Pandas DataFrame.
One liner vs. Cell magic
# Based on ticker-region, find fsym_id
fsym_ids = %sql SELECT cov.*, tr.ticker_region FROM sym_v1.sym_coverage AS cov LEFT JOIN sym_v1.sym_ticker_region AS tr ON tr.fsym_id = cov.fsym_regional_id WHERE tr.ticker_region IN ()
# Based on ticker-region, find fsym_id
%%sql fysm_ids <<
SELECT cov.*, tr.ticker_region
FROM sym_v1.sym_coverage AS cov
LEFT JOIN sym_v1.sym_ticker_region AS tr
ON tr.fsym_id = cov.fsym_regional_id
WHERE tr.ticker_region IN ()
SQL Syntax
First 10 rows from a table:
SELECT *
FROM client.rus3000_prc_data
LIMIT 10