Dev x Staging X Prod for Data Science

With data science it is not always clear if it is supposed to be done in Dev, Staging, or Prod. If you stop a random DevOps at a Starbucks, these are probably the strict definitions they would use:

Dev: where you code, using fake data to test.

Staging: you test your code with sampled real data, but if there is a problem, you come back to dev.

Prod: you run your application on real live data and never ever edit code here.

Now, where would you code and test with real live data, a.k.a., data science? Usually, this is the question that the first DS of a company needs to answer, and I believe it is not a coincidence that when it was asked in Stack Exchange it got two completely different answers. Here is what most organizations end up with:

1. Prod “Playground” tool

You use a sort of client installed in Prod that allows you to do Data Science. That can be something that lets you code, such as Databricks, Sagemaker notebooks, Vertex notebooks, or some low-code/no-code solution (hey vendors, if you pay me, I can insert you here). You can also consider the little window where you write your queries for Redshift/BigQuery/Postgres as a Prod Playground.

+ + Pros: the tool will abstract security, data access, and project organization for you. Your DevOps team will just see another tool there for them to take care of.

– – Cons: most tools are paid and expensive. In some of them, you will not have access to a code repo solution, either because they don’t support it (and sometimes clearly don’t want to support, as they see DS as a pure business thingy). You might also have difficulties when sending the result of your work to serve a live application: the RandomForest you created in this tool will just be a binary living somewhere that was not created via a Dev->Stag->Prod pipeline. By definition, you are sending to Prod a piece of code that did not go through any code coverage, unit testing, etc. People will get very confused.

2. Dev with Fake Data

If I got a coin every time someone asked/suggested/hinted that there is no reason to use real data for DS, but rather artificially generated datasets should be enough, I would be Scrooge McDuck by now. This sort of opinion comes from just a misunderstanding of what DS is, and you should just try to explain how it works. Most of the time, they understand.

3. Code in an “ungoverned” Prod

You have access to coding tools and several “degrees of freedom”, usually in a separate cloud account, that still has access to Prod data, which might mean the database which is actually used by the live application or a sort of Analytical Data-Base/Lake/Mesh.

+ + Pros: It is very fast to setup. you can code everything you need and test any new tool that you want. You are fast and cheap while doing DS.

– – Cons: new software/packages are dangerous. You may end up exposing your company’s data to unsecured software. Moreover, because this account is so different from a traditional Dev or Prod account, your DevOps will most likely not understand or support it, meaning that you are on your own in terms of security and random AWS policy patches. If you are accessing directly the Live Application DB, one of your queries will ground the company to a halt

4. Build a beautiful Prod Sandbox

This is Option 3, but on steroids. You spend a lot of time to shield and automate everything that could go wrong with an ungoverned Prod, so that you have something that is flexible and secure. This Sandbox has both access to the Analytical Data-Base/Lake/Mesh and to the actual “Live” Prod (where your models end up living).

+ + Pros: you are safe, fast, and flexible.

– – Cons: it is very hard to build this. You are going to need mirrors of PyPi, limited capacity to change account configurations, etc. There is a lot of code to write to make a safe connection between Data-Bases and Live Prod. It is going to take some time to explain every little configuration change you need.

Which one to pick?

Notice that the three “serious” options all use Prod. It turns out that adding flexibility to an environment with real data is much easier than adding real data to a Dev environment. Now, choosing between Options 1, 3, and 4 will depend on your particular corporate environment.

If you have some money lying around and would like to avoid all those painful infrastructure conversations, go for Option 1, the Prod Playground. But try to go for tools that offer some Software Engineering capabilities, such as code versioning. Otherwise, it will be very hard to get out of it.

When money is tight, I recommend Option 3, the Ungoverned Prod, but being transparent with the risks taken for all parties involved, which is a fancy way of saying “things can go wrong here, let’s all hold hands and go gently into this dark night.” If you are starting a DS practice and aim right away for Option 4, the Prod Sandbox, you will spend precious time setting it up instead of doing something of value for your business, and that might not look good for you. Instead, try to consider every project as collecting “credits” that you can invest to transform the Ungoverned into a Sandbox. And spend your credits slowly over time.

vlzda's views