Coding Standard conformance in Open-Source Data Science projects
Data scientists play a vital role in translating mathematical and statistical techniques for AI and machine learning out of papers and into code for solving real world problems. However, as machine learning code is just one part of a larger system , data scientists need to work together with software engineers to make sure the system is production ready. Coding standards facilitate this collaboration and (usually) improve readability , which, in theory, supports greater maintainability and lower likelihood of bugs getting into production code. However, data scientists often come from diverse backgrounds, such as mathematics and physics, which means that they may not be as familiar with coding standards as software engineers, or may not value conformance to coding standards as highly.
In our paper, A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects  (preprint here), presented at this year’s Empirical Software Engineering and Measurement conference (recording here), we investigated open-source data science projects on GitHub to see whether or not they followed good coding standards.
To do this, we collected a corpus of 1048 Open-Source Data Science projects from GitHub, and ran them through Pylint (a code quality analyser for Python, the most common language for data science) to detect coding standard violations and code smells. We then compared the results to a reference group of 1099 traditional Python projects (with a similar number of stars and age to the data science projects to make sure it was a fair apples-to-apples comparison).
Comparing the number of Pylint warnings of each type between the Data Science projects and traditional projects, the most significant difference was in Pylint’s
too-many-locals warning (triggered whenever a function or method contains more than 15 variables). We found that Data Science projects contain over twice as many cases (per line of code) of functions with excessive numbers of local variables compared to traditional software projects.
We initially tried visualising the distributions as box-plots, but unfortunately they were difficult to read because of the highly skewed distributions involved and the dense blob of traditional projects with zero violations (leading to the bottom half of the box plot to almost completely disappear). To deal with this, we created survival plots showing what proportion of projects (on the y-axis) have at least the number Pylint warnings per line-of-code on the x-axis. The distribution for Data Science (DS) projects is shown with solid blue lines, and for comparison, the distribution for traditional (non-DS) projects is shown with dotted orange lines (top-right is worse).
We also found that Data Science projects triggered more Pylint
invalid-name warnings. These can occur due to short variable names (e.g. use of single letter variable names rather than something more meaningful) or violating case conventions (e.g. use of uppercase for variables that are not constant).
Next we looked within data science projects to see which modules were responsible for the warnings. The
too-many-locals warnings were mostly confined just to the core data science modules within a project (i.e. those making use of data science libraries such as Tensorflow and numpy). In contrast, the
invalid-name warnings spread evenly throughout theentire data science project, including unit tests and other surrounding code.
Although our study found strong evidence that data science code deviates from traditional software coding standards, this does not necessarily mean you should go on a crusade to refactor all data science code just to satisfy the code linter.
Data science often involves implementing complex algorithms, so it is reasonable to expect that there would be functions with many variables involved. They may also have more function parameters (which count towards Pylint’s
too-many-locals variable threshold) due to hyperparameters that can be used to adjust the algorithm behaviour. True, perhaps the functions could be broken up into simpler subfunctions, and use configuration objects rather than accepting long lists of parameters, but this would not necessarily improve their readability or ease of use. Furthermore, these warnings tend to be confined to just the core data science modules, which means that software engineers working together with data scientists do not need to be concerned that the surrounding code will be affected.
As for the
invalid-name warnings, looking at a sample of the violations, this seems to be due to code that is based on papers. If the mathematical notation used in the paper defines a variable called K, then that’s how data scientists will implement it in the Python code too, as just
K, whether or not it is really a constant in the code. Giving the variable a longer more meaningful name that conforms with coding standards may help software engineers feel good about the code quality score, but would make it harder for data scientists to see the correspondence of the code with the notation used in papers.
Code linters, particularly for dynamic languages like Python, also have a tendency to flag a lot of false positives, so are no substitute for human judgement. In my next post, I’ll cover some of the common pitfalls you need to be aware of when using Pylint to measure code quality.
 Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems, 2503–2511. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
 Dos Santos, R. M., & Gerosa, M. A. (2018). Impacts of coding practices on readability. Proceedings – International Conference on Software Engineering, 277–285. https://doi.org/10.1145/3196321.3196342
 Simmons, A. J., Barnett, S., Rivera-Villicana, J., Bajaj, A., & Vasa, R. (2020). A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects. International Symposium on Empirical Software Engineering and Measurement. https://doi.org/10.1145/3382494.3410680
Header image courtesy of Unsplash