Managing Databases in a CI/CD pipeline is frankly difficult. Applications are de...

jimpudar · on June 9, 2021

Sometimes you can't use production data in your staging environment, like when prevented by privacy laws. Often times developers who would not have access to production data would have permissions to view data in a staging environment...

Creating tools to produce realistic test data and also using fuzzing tools can be just as good and sometimes even better than using actual production data.

LukeEF · on June 9, 2021

I'm having a hard time visualizing this - are we talking outside devs (consultants) or folks in a different jurisdiction who don't have rights under law, or just a company administrative policy? Agree that fake data can work well, but if you can clone the production DB, that seems a preferable and easier approach.

dwohnitmok · on June 9, 2021

There's a variety of reasons this may happen that stem from either legal, certification, or company reasons.

Your production database may have medically-sensitive PII (or for something like SOC-2 compliance any PII at all) that cannot be shared any human (other than the original user) unless with prior approval.

Even for non-externally mandated reasons, companies may (and often do) wish to restrict access to production data by developers to minimize concerns around data exfiltration and snooping on user data by company employees.

LukeEF · on June 9, 2021

that makes sense - I suppose using my model you could mask some of the data in a versioned graph or a collection contained in the database that can be surfaced up to other users who can then clone the collection that excludes PII. You could run the main collection and the PII free collection in the same data product. This might be an easier approach than creating fake data & fake schema.

dwohnitmok · on June 9, 2021

> This might be an easier approach than creating fake data & fake schema.

I doubt this. This is for two reasons: the first is that the development database usually shares the same schema as the production database so that's not an issue.

The second is that fake data convincingly takes care of various issues surrounding de-anonymization of data using correlations among bits of data that ostensibly have had their PII-sensitive bits removed.

If protection of user data is a priority, there are far fewer headaches associated with creating entirely fake data to populate the same schema than trying to figure out post-hoc censoring of production data.

That's not to say there aren't valid use cases of the latter. You often will want to do post-hoc censoring/aggregation if you wish to track e.g. usage metrics. This is in fact often a component of ETLs. However, those are removed from everyday development tasks.

Silhouette · on June 9, 2021

I am struggling to think of any time it would be appropriate to clone a production database and use it for staging. That almost inevitably translates into any developer who ever deploys database-related changes having access to the contents of the production database, or at least every part of it that might ever be affected by code changes they are making. There are numerous legal, regulatory and ethical issues with granting such broad access, unless you're small enough that the few people you obviously do need to have the ability to grant full DB access happen to include anyone who is likely to change code that talks to the database. And even then it still seems like bad practice to actually grant anyone that access when it's not strictly necessary!

ljm · on June 9, 2021

Legally. I've worked in healthcare and any direct access to production is strictly forbidden. If you can't figure it out through your logs and monitoring, and you can't reproduce in earlier environments, then the problem isn't that you don't have prod access, the problem is that your instrumentation has a gap.

Personally, I think we rely on prod access as a crutch because it's easier to expect that than it is to build a sufficient infrastructure. Cloning a prod database or allowing ad-hoc r/w access is on my list of strictly forbidden operations.

Aeolun · on June 9, 2021

> Personally, I think we rely on prod access as a crutch because it's easier to expect that than it is to build a sufficient infrastructure.

Which I think is reasonable if you are not in a strictly controlled space.

Not everyone is able to spend weeks or months on instrumentation efforts.

Aeolun · on June 9, 2021

For example, we, as a company have decided on some rules around PII (personally identifyable information) that means, amongst other things, that we don’t have any of it outside of production. We therefore cannot just copy data from production to staging/dev.

linux2647 · on June 9, 2021

Alternatively, my company tends to rely on “testing in production” to avoid having different environments. If we need to try something new, we use feature gates/flags to only give select users access to experimental code paths. It’s worked well for us, but it does take some getting used to

zild3d · on June 9, 2021

> We should manage the revision of data the way we manage the revision of our source code

I believe planetscale is trying to do this, introducing branches to the database https://docs.planetscale.com/concepts/nonblocking-schema-cha...

Tainnor · on June 9, 2021

Many frameworks and libraries (ActiveRecord, flyway, etc.) allow you to write migration scripts that get automatically applied during deployment. In that sense, you can version changes to your DB. Of course, it's true that you don't have a strict guarantee that your prod DB matches these scripts, but as long as you don't change the schema from outside such scripts and don't manually rollback/delete them etc., it should work out well in practice—at least, I've never had an issue with that specific part of migrations (of course, there are other issues you can run into with migrations).

Sebb767 · on June 9, 2021

This fails as soon as you have multiple instances of your app running (for load balancing or HA reason). Server one might still be running version A while server two is running the migrations done by version B - leading A to crash or corrupt data. If shutting down for the length of the update is not an option anymore, you will need to use the process the author pointed out.

Tainnor · on June 10, 2021

That's not really related to the topic of DB versioning though, I feel. What you said is true, however it only matters if the DB operations are backwards-incompatible (something which you can avoid by using the approaches from the article).

0xbadcafebee · on June 9, 2021

> We should manage the revision of data the way we manage the revision of our source code.

I think some new features of e.g. Postgres allow for something like this, but all SQL databases have the same basic design flaw: no inherent version control. SQL Databases are so big and complex that they haven't evolved much in 30 years. That's why there's so many "new" databases that are either a wrapper around Postgres, or trying to re-create parts of a relational database from scratch.

It would be great if the next iteration of the SQL standard included version control semantics.

tremon · on June 9, 2021

It's really not that difficult if you do the necessary legwork beforehand. That basically means, do not access the tables directly. Databases offer a read abstraction via views, and a write abstraction via stored procedures. Use those in your app, and suddenly you can atomically change your database schema without your app even having to go offline.

And as long as you take care of backwards compatibility in your abstraction layer (i.e. sensible defaults for new input columns, hide output columns in the view), you can run both the old and the new app side-by-side.

0xbadcafebee · on June 9, 2021

That requires the developers to be very familiar with databases, which they usually are not. And I've never heard a single developer in my entire career say "Hey let's talk to a DBA before we design and implement this feature".

Aeolun · on June 9, 2021

> Hey let's talk to a DBA before we design and implement this feature

It’s because we don’t like half of our application logic to be in random stored procedures that may or may not be there, and may or may not be the same across environments.

avgDev · on June 9, 2021

I use .Net Core, and often make changes to data base through the application and the application updates the database in both staging/production. This creates a migration history and allows me to revert back to a previous database state. However, in some instances data is still lost, in that scenario I would just do a back up.

This way I can almost painlessly revert back, and restore data.

pottertheotter · on June 9, 2021

I've not looked into it a ton, but would something like Pachyderm address this?

LukeEF · on June 9, 2021

Yes - pachyderm is good for managing and running distributed data pipelines on flat files

There are also versioned databases like TerminusDB and Dolt that try to do similar in the database

twic · on June 9, 2021

This stuff is not trivial, but it's also not rocket science.

Production is production. Everywhere else, the database is temporary. It is be blown away and brought up to a known, source-code-defined state when the application is deployed. You never migrate any database except production. You always rebuild from scratch. Think of the state of the database as being more like the binary than the source code.

The state of a database is defined as the content of a specific base dump from production, plus some number of SQL scripts to run on top of it.

Base dumps need to be put somewhere accessible. When i worked with Oracle, they lived on the server. You run a command to create a dump on the server, then get a sysadmin to scp the dump file to your dev/test database server. With PostgreSQL, you can dump to a local file, so set up a file server or something where dumps can go. Could even be in source control, using large file support. Base dump files are named with a timestamp, GUID, etc.

In the app codebase, there is some config giving the name of the base dump, and those additional SQL scripts. There's some tool, script, build command, etc, which uses those to rebuild the database - delete (or rename!) database, restore base dump, run SQL.

You might not want to include everything in the dump. When i was doing e-commerce work, we would dump the schema and the catalogue data, but not user data (profiles, orders, etc), because of data protection. If your database is very big, obviously you will have to be very selective about data here.

On one project, the process of rebuilding a database was quite slow, because there was a lot of stuff to add on top of the dump (not just SQL, some of it used some weird data loader tool). To make that faster, we did caching: after a rebuild, we would dump the state of the database, then when doing a rebuild, check if there was already a suitable dump, and if so, just load that. The dump was named as a hash of everything that went into it. Easy to get that wrong! For example, the hash should include the version of the data loader tool.

When you finally release to production, then as long as the dump is similar to what is live, you just run the SQL scripts; it's like a dev deploy but without deleting the old database and loading a dump (deleting prod and loading a dump of prod peephole optimises to doing nothing!). 'Similar' means that the schema is the same and the data, if any, is reasonably fresh. If the dump is behind production somehow, you have some dev work to do where you update to a fresh dump, and amend the SQL scripts as needed.

Some time after a release to production, take a fresh dump, and propagate it into the source code: update the dump filename in config, and remove any SQL scripts which have now been released. You don't have to do this right away, because the fresh dump should really correspond to the old dump plus the scripts anyway.

If there is additional data or schema you need in some lower environments, that's fine. Include that as some extra SQL scripts which will never be run on production.

Tainnor · on June 9, 2021

I don't understand why you shouldn't migrate non-production databases. I agree that there should not be an expectation that they are stable (although it can still be useful to keep data around in test environments), but if you constantly apply migrations upon every deployment (as well as locally), you have a good test bed for the production migration.

That gives you at least a base level of sanity checks that your migrations work. Of course, this is no guarantee that your production migration will also work - you may e.g. add a FK constraint but there are violating rows in prod -, but for that you can wrap your migration in a transaction. Upon failure, the migration is rolled back and, of course, you should then also rollback the deployment. Then you can start investigating why the migration went wrong.

Also, creating dumps from prod and storing them locally is something I don't recommend (and it's also disallowed at the company I work for) unless you take care to purge/mask all sensitive data (all PII and everything that includes trade secrets of some sort) before it reaches your local machine.

twic · on June 9, 2021

I wasn't very clear in my statement about migrations, sorry. I'll put it another way: the only migrations you should run in dev are ones you intend to eventually run in production.

The process of loading a base dump and then running SQL scripts is precisely to mimic what will happen when you release to production. As you say, this is a good test bed!

What you shouldn't do is try to move between existing intermediate, non-production-like, states of the database by migrating. And really, what i object to are down migrations, because you won't ever use those in production. Migrating up is not so bad. But since running individual up migrations is something you will never do in production, why do it in dev?

Indeed, always running your whole queue of up migrations on top of a fresh base dump is a way of catching the kind of problem you hit in production, where existing data violates a new constraint.

The nuke-and-pave rebuild approach also avoids the problem of moving between branches and reverting pushed but unreleased schema changes. Those become trivial.

As you say, and as i said in my initial comment, there will be data in production that you can't or don't want to use in dev for some reason. But that's just another development task - you need to take your base dumps in such a way that that data is not included. You can exclude certain tables. You can only dump certain rows, corresponding to some kind of testing user etc (one bank i worked with had a "shadow branch" defined in their production systems, a made-up branch with made-up customers, that they could use for final testing in the actual production environment). You can dump into an intermediate database in production, anonymise and delete any data you need to, and then take the base dump. I'm sure there are loads of ways to do this.

Tainnor · on June 10, 2021

Ah yes, I generally agree with what you said (although I think it's fine to occasionally do destructive and weird things to your local DB and roll them back if you're working iteratively; you can either recreate from scratch if you want to make sure you didn't break anything and otherwise you might still have some pre-production stage where you would notice it if the migrations didn't work).

Aeolun · on June 9, 2021

I think this only really works as long as your dumps fit on one machine.