Being an analytics/reporting product company, weâ€™ve worked with a fair amount of customers who run Postgres/Redshift for their analytics, and the same fair amount who uses MySQL.
And from our experience with both, Iâ€™d prefer our new customers to use Postgres all the time. For reporting/analytics/data-warehouse purpose, Postgres wins over MySQL hands-down.
In this post I lay down a few practical reasons why working with Postgres is so much better than MySQL from a data analyst perspective.
1- MySQL doesnâ€™t supportÂ CTE.
Unlike production application where your queries should be as simple and straight-forward as possible, queries for analytics are usually quite complex. And very often you need to build intermediate resultset to be used multiple times across your queries.
The query below in Postgres will get the employed people together with their supervisor.
# Postgres with employees as ( select id, first_name || ' ' || last_name as full_name, supervisor_id from people where person_type = 'employed' ) select E1.id, E1.full_name, E2.id as supervisor_id E2.full_name as supervisor_full_name from employees E1 left join employees E2 ON E1.supervisor_id = E2.id
With MySQL, the query will need to be more verbose and difficult to maintain, because you have to repeat employees into subquery. See the difference.
2- MySQL doesnâ€™t support window functions
I canâ€™t stress enough how useful this is when it comes to doing analytics. Window functions bring Postgresâ€™ capabilities to a whole new level, allowing a lot of complex, sophisticated requests to be done. There have been countless times that I need to use window functions to do a certain report.
3- MySQL doesnâ€™t support schema (namespacing)
Your analytics database donâ€™t usually contain just data from one source. You want to pull multiple data sources into one place. Things like click-stream events (in Hadoop), 3rd-party data sources like Adwords or Google Analytics, multiple production databases (when your apps are doing service-oriented architecture).
When these happen, namespacing your data from different data sources become a must-do. And MySQL doesnâ€™t have an elegant way to do it! Iâ€™ve seen people doing it the prefix way
soa1__users soa2__bookings adwords__clicks_by_country
Itâ€™s like having one folder with over 100 files, and no subfolders. Itâ€™s difficult to manage, itâ€™s overwhelming, and thereâ€™s no native way to work with all tables from the same data source.
Postgres saves you all these pains with schemas, so that you can put data in different data sources into one dedicated schema.
Update: Edwin pointed out in the comment that my MySQL I can do cross-database joins, thus getting the same functionality of Postgresâ€™ schemas. While this is posible, it doesnâ€™t feel right to be to maintain different databases just for the sake of categorizing them.
4- Postgres has better interface when dealing with datetimes
Dealing with dates/datetimes is a major topic when doing analysis. With my experience so far, Postgres has a far more consistent interface over MySQL.
Take a simple example of converting a timestamp field to day, week (first day of week), and month (first date of month).
- DATE_TRUNC(ts, â€˜DAYâ€™)
- DATE_TRUNC(ts, â€˜WEEKâ€™)
- DATE_TRUNC(ts, â€˜MONTHâ€™)
- DATE_ADD(ts, INTERVAL (1-DAYOFWEEK(ts) DAY))
5- Other smallerÂ things:
I try to make this post sort, so below are some smaller things in Postgres is better over MySQL, that I find immensely useful.
MySQL doesnâ€™t support full outer join: I know this might not be a big deal, but it is a big deal when you need it and it isnâ€™t there. Learn more why here.
Wrong group by doesnâ€™t throw error in MySQL: This query will give an error in Postgres, but not in MySQL (itâ€™ll give strange results). Iâ€™d rather get an error thrown at me, than getting a wrong result.
# error in Postgres, and give wrong results in MySQL! SELECT listing_type, count(1) FROM listings
Postgres supports VALUES for manual values list: In Postgres to generate a manual table, I can use VALUES, with MySQL, I have to use UNION ALL
# Postgres values ('2015-01-01', 100, 200), ('2015-01-02', 200, 400), ('2015-01-03', 300, 600) # MySQL select '2015-01-01', 100, 200 union all select '2015-01-02', 200, 400 union all select 2015-01-03', 300, 600
Postgres has generate_series This has helped me countless time from generating number series, sample data, and date ranges!
Postgres has materialized views, to help you with pre-calculating data (complex queries) so that querying them will be quicker later on.
MySQL is indeed the most popular database, and is a fine and battle-tested one. But over the past few years, Postgres has taken over and become my favourite database, both for building production applications, and doing data analytics. I do hope MySQL will catch up with the missing features, so that itâ€™ll make some (if not a lot) of our lives easier.
The above list is nowhere near exhaustive, so feel free to share in the comments if you have anything to add.