![](https://crypto4nerd.com/wp-content/uploads/2023/06/0y1I9psRL9B4OmUed-1024x1536.jpeg)
This article — is a high level overview of tools that, from my experience, plays a crucial role in building quality software.
Lack of one or even several points usually doesn’t cause fatal damage. But humans are very creative and make a combination of circumstances even more terrible than lack of several validation points for no reason (likely to re-validate proven theories). Also many people, while digging into a specific area, tend to forget about “broad” knowledge.
I believe testing (together with monitoring) is one of the fundamental and most valuable areas in modern software engineering. While programming and data analysis are more and more automated and require less attention from humans, testing, after all, is just a way of declaring what we want as output. If we are unable to describe what we want (or audit what was done) — then our importance in this cycle rapidly decays.
I mostly use Golang and Python at work, so almost all examples relate to these programming languages.
The title is inspired by this beautiful video by “Rational Animation”.
This is definitely a huge area, which I’m not very familiar with, but I can not skip mentioning the golangci-lint. It’s mostly an aggregator that helps organize other tools and does this job pretty well. However, sometimes it could arouse a few inconveniences, for example gosec could produce reports for SonarQube, but results wrapped by golangci-lint look worse than direct integration. Anyway this kind of problem is solvable by using different configurations for CI/CD and IDE.
SonarQube is an example of another subtype — continuous inspection, which means it also provides reports for changes in code quality in time. In addition it has a very nice feature of checking cognitive code complexity — and it works out of the box relatively well.
We have a wide range of tools for validation data in “instant” communication protocols (mean RPCs) and a “delayed” approach for communication (mean queues). OpenAPI specs are often used with YAML. JSONSchema also still plays a viable role, typically in various kinds of REST APIs. If you are ready for binary protocols — you should choose wisely from Avro/Protobuf/Thrift. Speaking specifically for Golang solutions — go-playground/validator is a good and popular option.
Also, structures with static types are validators by themselves. Interesting that some of these tools could be found in communication with DB or Queue and the above options available at them. DBs typically have combinations of “schema on the write” and “schema on the read” — it’s another abstractly described validation layer. Don’t know about queues in general, but Kafka has a Schema Registry, which not only helps validate data, but also significantly simplifies the evolving of these schemas.
This kind of technique is likely a subtype or descendant of “design by contract”. If, for any reason, you want this style of validation, then you could easily find libs on the mentioned wiki page.
I don’t know of any available free tool for Golang — however it probably exists. Examples of commercially available tooling might be deepsource.io. These guys claim that their product could replace a huge bunch of free and commercial tools.
Of course we have examples of publicly available work for other languages: this one about Javascript [1] and a more recent one about Python [2]. Both of them were mentioned at Snyk’s deep code engine presentation. Across Snyk advertisement, it is clear from the video that for applying (or buying) any tooling for this purpose — the organization is obliged to have at least one actual specialist in the area, if they don’t want to be fooled. Also an interesting observation is that this tool is designed to complain about the absence of layers mentioned in the previous paragraph about data validation.
Overhyped Github Copilot also could be put in the list. And his open source alternative — FauxPilot. But I don’t have a goal of making a complete review of such tooling. Recently many articles have been published about the topic, with an attempt to classify these tools. You could gain your own opinion by reading several of them.
Schemes or graphs
Surprisingly (I’m joking) the quality of design documentation directly affects code quality. From my personal experience, schemas work better than text descriptions. Meanwhile, a lot of software developers try to ignore the tooling. UML, BPMN, ERD, and C4 — the four seem the most popular. Across that, a simple mind map also works pretty well — but has a slightly different direction — to explain ideas to the author himself. And what is interesting — it could be done vice versa — for example generating Class diagrams from existing code. This might be used in actual-vs-desired state workflow. Visualizing definitely helps create better programs even if an actual program hardly correlates with the design diagram at the end.
Design principles like SOLID and KISS look too trivial to mention here. Especially since the last one definitely failed in the global view. But unfortunately, the percentage of efficient applications in practice, at least the first one, is so small — it looks like only a small percentage of engineers know anything beyond definition. I highly recommend the course for anyone who writes in Go. It gives a good connection between basic theory and the actual functionality of the language. Sadly, the course hasn’t been updated to include recent changes in the syntax.
I was taught that modifying code for doing something specifically for testing purposes is a bad idea. But it’s a very subtle line — actually a lot of code in OOP style is written exactly for this purpose. Interfaces are most of the time used for simplifying unit testing, not because of thinking about future possible variations. Some will argue that this is a flaw of OOP style, but I have one more example that isn’t connected to OOP. It’s FSM, state machines for me it’s embedding activity diagrams inside the code for continued validation of the idea about how it must work. One another observation — usually FSM appears in Software Engineering and Markov chain in Machine Learning — but they have a very strong connection.
As we came to FSM I have a couple more points to add here. First, we are likely to approach data validation again. The state is also represented as data, and state transitions are the same but with sequence markers. Second, the UML diagram type has a specific subtype — UML state machine. (It’s often forgotten even on pages specific to FSM, that is why I try to advertise it.)
Of course, it’s only the tip of the iceberg, and by digging deeper, you could find many more inspiring concepts.
Formal methods
When speaking about validation without embedding into the code, I also want to mention languages which allow performing algorithm validation — or formal specification languages. We have three main categories here, but I mention only two. The first — model checking — a kind of bruteforce checking for all possible states (yes, it’s different from fuzzy testing, because it requires modeling of the system in specialized language). It seems TLA+ is a good example of this class. This is a tool for validation programming architecture with previously mentioned state machines concept. The second class is an automated theorem prover, and I think the most popular example is the Z3 theorem prover. Very interesting aspect of Z3 — it could be applied to finding solutions, not only for validating them. And what is even more impressive — connection between Z3 and TLA+ — apalache. You can say — how does all this scientific staff relate to Golang programming? Look closer at this project named GCatch [3]. Unfortunately the project currently almost doesn’t evolve. After looking into source code I concluded the project suffers from poor code quality itself.
Of course the area is much wider than these two languages. For example we have one specialized in designing distributed systems — P. Actually TLA+ is also designed with thoughts about distributed systems in core. Look for example at the original work about Paxos.
Despite the hardness of implementation and support in the actual state the area looks as one of the most promising in software engineering at all. Due to the price of implementation, adopters try to cut down costs by covering only the most critical (or easiest for modeling) parts of software. However, nowadays, it’s hard to impress somebody with complex code or AI, but code without bugs is something “new” and really demanding. And I expect the technology will soon be mentioned not only in subjects about rocket science, medical devices or airplane crafting.
Just beginning
I think testify package is almost the standard way of writing tests in Golang. However, I often see weird constructions instead of just simply using the testify, even in popular open-source projects. More “advanced” users like “table” tests — ok, I might agree that this could be useful to verify a block with minimal parameters and not many possible outcomes. But what I often see — test rows containing black magic flags, which passed inside an infinitely long t.Run(…) block. And surprise — most IDEs badly support this type of construction and when you want to debug a specific case you simply can’t do this without tricks. The situation becomes even worse when the “table” contains tens or hundreds of rows. Maybe I just don’t know how to deal with such problems, I would be grateful if someone could give me a solution.
But even if the problem above is solvable, in my opinion, the design itself is invalid. Instead of making these awkwardly looking tables, I prefer using Suite from the testify package. If you want a repeated logic that doesn’t comply with SetupTest/TearDownTest design — it’s always possible to define a private receiver function at the Suite inherited type. And the construction provides a logically necessary place for description in the test name function when the “table” tests require a “name” column — which for me is a slightly strange solution.
Mocks, BDD and more
Also testify contains a Mock package, but mostly for historical reasons, gomock is more widely spread. And there are more solutions on the market for the golang mock topic.
By the way, all these staff somehow cover only basic needs in testing, when for any reason you don’t want to increase the complexity of a project. If you are more or less serious about testing it’s mandatory to know about BDD style frameworks — godog or less bound to the original but with a convenient fluent interface — ginkgo.
I’m not sure about the right open source lib for fixtures concept applied to golang (a lot of info on the internet but it doesn’t match fully with my understanding of fixtures), last time when I needed something of this style I wrote my own fixtures loader strictly bound to the specific task. My understanding of fixtures is inspired by pytest.
Another interesting story is the Allure project. It has many libs on Golang — just google and choose whatever you like, or maybe you want to write another one?
Other methodologies
Honestly, testing isn’t my main area of interest. Could only recommend the wikipedia page and use all of them as much as possible.
Testing of database schemas and various staff also slowly grows. Just an example.
One place where I want to make a pinpoint is performance testing. I think it’s more or less clear how to write benchmarks using embedded golang functionality or perform profiling using go pprof. What’s more difficult is defining criteria for “premature optimization”. And here a concept named causal profiling [4] could help. It has golang support as well.
I’ll return to the problem of “causality” closer to the end.
Above this, I also want to mention another project from the same author for python — Scalene [5]. However, you likely already know about it.
Training data validation
Today it is almost impossible to imagine a competitive service that doesn’t use ML. There are a lot of things which might be done, but I will speak about one popular approach. ML driven system that in core contains a supervised learning model that must be retrained on a regular basis.
We could use one of two or a combination of the following techniques. Filtering the data itself or having logic for declining results of training on the “bad” data. I mean using the model itself for characterizing input flow. The second might be a bad idea itself, but could be easily handled by rolling back to the previous version. The trained model itself should at least have a backup anyway (or better version control).
Because I specialized in time series (shortened to TS below) almost all my examples relate to this area. There is an almost infinite range of tools for validating input TS flow. I mean features that can be extracted from the input data and consequently validated. Let’s begin from the features themselves.
Feature extraction
A set of interesting methods is provided by pyts library, but a significant part of them assumes specific input data — I mean prerequisite knowledge about the TS meaning.
A more general and classical set of methods is presented at tsfresh library. Another set of methods about correlation between time series could be found at tslearn. There is more of them — a set of statistical methods implemented at the cesium-ml project. Actually, so many methods for generating new data from existing ones that research about the importance [6] of them exist.
Declining wrong models
The process of throwing away not very successful models became a separate discipline in ML.
First, we should decide with what data we want to compare the results of a model for making decisions about failure or success. In supervised ML most typical is to hide part of the data at the training stage and test the model against the hidden part. To do it effectively people invent tricks, but not all of them work well for time-series.
Then we choose with what function we evaluate the difference with actual results. Look at several popular model KPIs. The further talk about “regression metrics” — and they are interesting for me because of their direct connections to time series, the second group — “classification metrics” also might be interesting in the context — but through support TS specific methods by classification ML methods). A high-level overview of the concepts can be found in this blog post.
Another source of the madness is trying on the fly to figure out in what direction the models evolve and cutting branches without prospect. A good quality set of them is presented in mle-hyperopt package. Beginning from the simplest grid search to relatively new approaches, for example, Hyperband (a novel Bandit-based approach). Across the community, Optuna seems to have gained more popularity. Meanwhile, the methods between libraries do not completely overlap. So, there is no silver bullet — and they have to be chosen wisely. At this point, we significantly stepped out of the original topic, because we touched the area of auto-ml, but several more points must be mentioned.
The techniques above could be combined with model ensembling and individual model hyper parameter tuning — example with evolutionary algorithms for organizing the process is FEDOT [7].
Another approach to search optimal hyper parameters is to train a neural network (or any other model) for choosing a specialized optimal model with optimal hyper parameters (or set of them for further processing with one method above or ensembling) without additional learning (a.k.a. zero-shot or few shot inference) — this methodology often appears as “meta-learning”. Well known example in the TS world is Kats [8].
Let’s retreat to the feature extraction — why might it even be important, if we have cross-validation, a wonderful set of KPIs and state of the art accelerators for these staff? Because we can receive TS data that is able to make a damage through all types of defense in these systems.
Data normalization
Above the normalization data itself, I want to make several notes about normalization timestamps for the data. Various algorithms could be more or less tolerant to the timestamp skewness (for clarity — skewness function against time series timestamps). Based on the distribution of missing data, we could make different decisions about filling missing points or taking a block without (or with a small number) of missing points. Also good to know the overall percentage of missing data (it requires a fixed scaping interval for data points or any other clue for understanding an actual number of points).
If after the preliminary analysis above we came up with the idea that we have to deal with completely irregular time series — then likely a good idea is to use specialized methods: Croston’s method [9], Lomb-Scargle periodograms [10], etc.
Another simple, but often forgotten step, is to check timestamp boundaries — no datapoints from the future (usually caused by program errors and attempts to use results already processed by another model) and no super ancient data (typical source is problems with date conversion and corrupted data in DB).
In addition to the validation of regularity above, good to make a more significant frequency analysis. If the autocorrelation functions and Fourier analysis didn’t help, Wavelet transforms might be an option. Don’t forget to check stationarity if you are old-fashioned and your model (or system of models) is sensitive to it.
Also quite often it is necessary to normalize metadata for time series (univariate extraction or multivariate — doesn’t matter at this point). There could be used a lot of techniques, but I want to highlight one specific tool — PClean [11]. Intriguing that at the conference in 2019, it has integration with the GPT model, but in the repo, I can’t find any mention of it. Unclear — I bad in search, or this link deprecated as ineffective, or quite opposite — too effective to be freely available.
And of course one of the best all-in-one solutions for the previously described steps, plus more, is TFX. It’s worth a look, even if you don’t plan to use it. But, of course, this isn’t all, several more specialized tools were developed. Deepchecks — likely one of the competitors in the field to TFX.
TheDeepChecker [11] is definitely a promising project with the goal to identify various problems with programming deep neural networks, including problems with input data. Also, this one is quite similar to the tools in the “AI code review” topic above, but this time targeted against ML code.
Short remark about model explanation
Except the power of supervised learning, important to understanding how a model interprets consumed data. A lot of articles written about SHAP — it’s likely an industrial standard.
This staff is also required not because of curiosity or you don’t trust machines, but due to complexity on other stages. It could reveal flaws in data cleaning / normalization, overcoming overfitting, etc. Or our input dataset is just too small, or we are just unlucky enough to catch a bunch of outliers in our sample data.
Plan B
If our tricks with input validation, model verification, and verification models for model verification fail — we are obliged to have plan B before it happens. Because it is inevitable — this technology has too many parts. By reliability theory, it means we have a lot of points for failure. Even a carefully prepared page about maintenance is better than a completely broken system demonstrated to the user.
I think sooner or later an organization must define strict rules for operational acceptance testing. Kubernetes and properly configured modern databases satisfy most of the typical requirements of this kind, however regular attempts to reinvent the wheel to improve performance should be regulated by the set of rules.
A surface overview of post-deployment testing could be found here.
Even if we somehow failed to prepare a quality product with previous steps we have a chance to prevent our reputation damage (or at least reduce it) by using clever roll-out to production. We have a bunch of classic methods of switching to the new version: blue/green, canary, etc. At this point, it became clear that without a modern routing ecosystem, it will be hard or impossible to use these tricks.
Also if you have persistent storage with requirements for schema migration it’s important to think not only about backward compatibility but also about tested migration rollback. The migration script (as well as a rollback migration script) is obliged to match performance requirements in production in the rare case that it blocks everything else.
Of course, it’s difficult to maintain all these procedures without a proper framework for CI/CD. The flamingo looks pretty intriguing, however, I think any tools from the area, if used properly, tremendously simplify life. Gitops approach, despite criticism, provides a crucial role in organizing a sequence of state transitions (yes, it spread everywhere). When we have to deal with regular rollout it is a good idea to check the error budget.
There is also another method in use if we inevitably can’t test something without rolling it out for a small portion of customers. A/B testing is a separate direction of the science, but few words should be said. Today the results are almost always analyzed with ML. “Treatments”, which make an effect on covariant groups, also exist as sequences of state transitions (sounds familiar?). We could make predictions with optimization local sort of KPIs (CATE, ITE).
Actually, this staff (like many others) intended to be used for modeling a better alternative reality (where we get more money or patients survive). But provide this info not in the form of fancy fully rendered digital reality. For taking advantage of this technique just enough to understand numbers, not pictures. The trick significantly improves what our brains have been doing for millions of years — making predictions about the next action based on past information.
All this collection of thoughts could look fragmented. But I have a couple of reasons — why all of them are united into one article. First: golang fits well as the “first line” of defense in a system that contains ML or complex code inside and written in python. Writing it in python is also possible, but from my experience, Golang significantly simplifies this task. It works well for a simple and straightforward task of schema validation. Second: time-to-time useful interconnections appear. DTW (or method from this family, e.g. CTW [12]) looks pretty useful for testing tasks related to the time series (where time shifts are possible).
I didn’t have the luck to try it in practice — but I have several ideas in mind: converting queries of time series databases (compare results of different query engines with minor shifts due to algorithms fixes), searching for similar items with unsynced timestamps, anything related to human-based data, etc. Gap-filling from FEDOT [7] might help hide issues on other layers of defense in extreme cases. Granger causality could be used not only in preparing covariants for prediction models. And probably you want to know about the combination of DTW and Granger causality — Variable Lag Time Causality [13]. If the idea “catches” you — for this staff a wide range of methods are available [14].
[1] M. Pradel and K. Sen, “DeepBugs: a learning approach to name-based bug detection,” Proc. ACM Program. Lang., vol. 2, no. OOPSLA, pp. 1–25, Oct. 2018, doi: 10.1145/3276517.
[2] M. Allamanis, H. Jackson-Flux, and M. Brockschmidt, “Self-Supervised Bug Detection and Repair.” arXiv, Nov. 16, 2021. doi: 10.48550/arXiv.2105.12787.
[3] T. Tu, X. Liu, L. Song, and Y. Zhang, “Understanding Real-World Concurrency Bugs in Go,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, in ASPLOS ’19. New York, NY, USA: Association for Computing Machinery, Apr. 2019, pp. 865–878. doi: 10.1145/3297858.3304069.
[4] C. Curtsinger and E. D. Berger, “C oz: finding code that counts with causal profiling,” in Proceedings of the 25th Symposium on Operating Systems Principles, Monterey California: ACM, Oct. 2015, pp. 184–197. doi: 10.1145/2815400.2815409.
[5] E. D. Berger, S. Stern, and J. A. Pizzorno, “Triangulating Python Performance Issues with Scalene.” arXiv, Dec. 14, 2022. Accessed: May 27, 2023. [Online]. Available: http://arxiv.org/abs/2212.07597
[6] C. H. Lubba, S. S. Sethi, P. Knaute, S. R. Schultz, B. D. Fulcher, and N. S. Jones, “catch22: CAnonical Time-series CHaracteristics,” Data Min. Knowl. Discov., vol. 33, no. 6, pp. 1821–1852, Nov. 2019, doi: 10.1007/s10618–019–00647-x.
[7] N. O. Nikitin et al., “Automated Evolutionary Approach for the Design of Composite Machine Learning Pipelines,” Future Gener. Comput. Syst., vol. 127, pp. 109–125, Feb. 2022, doi: 10.1016/j.future.2021.08.022.
[8] P. Zhang et al., “Self-supervised learning for fast and scalable time series hyper-parameter tuning,” ArXiv Prepr. ArXiv210205740, 2021.
[9] A. Segerstedt and E. Levén, A study of different Croston-like forecasting methods. 2020. Accessed: Apr. 24, 2023. [Online]. Available: https://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-78088
[10] J. T. VanderPlas, “Understanding the Lomb-Scargle Periodogram,” Astrophys. J. Suppl. Ser., vol. 236, no. 1, p. 16, May 2018, doi: 10.3847/1538–4365/aab766.
[11] A. K. Lew, M. Agrawal, D. Sontag, and V. K. Mansinghka, “PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming.” arXiv, Nov. 18, 2022. doi: 10.48550/arXiv.2007.11838.
[12] F. Zhou, “Canonical Time Warping for Alignment of Human Behavior”.
[13] C. Amornbunchornvej, E. Zheleva, and T. Berger-Wolf, “Variable-lag Granger Causality and Transfer Entropy for Time Series Analysis,” ACM Trans. Knowl. Discov. Data, vol. 15, no. 4, p. 67:1–67:30, May 2021, doi: 10.1145/3441452.
[14] R. Moraffah et al., “Causal Inference for Time series Analysis: Problems, Methods and Evaluation.” arXiv, Feb. 10, 2021. doi: 10.48550/arXiv.2102.05829.