5 percent of the 420 python codebases we checked silently skipped tests (so we fixed them)
Duplicate names for tests in the same scope results in some of the tests being skipped:
# file with 3 tests (with duplicate names)
The tests were written. Lets run them:
tests/test_helpers.py::test_a ✓ 100% ██████████Results (0.85s):
Three tests were written but only one test was found in the file and therefore only one test was ran. The reason for this is a basic feature of Python (and probably all programming languages): when a function or class or variable is defined multiple times in the same scope then the last definition overwrites all previous ones. This is true for all name assignments in general. Here’s a diagram of different scopes a test may be in:
Tests can have the same name if they are in different scopes, but the same name in the global scope, or the same name in the same class scope would result in later tests overriding the earlier one. Test runners such as pytest or the built-in unittest package can only detect and run the tests that are defined in the test files — so if a file contains a duplicate test name then the earlier ones will not be available to the test runner and therefore the test will not run.
We at Code Review Doctor automatically scanned the codebase of 420 public open source Python repositories on GitHub to detect tests that were skipped because they had duplicate names within their scope. We checked for “pytest-style” functional tests or “TestCase-style” class tests (fun fact: while Pytest has a lot of proponents, 28% of Python devs still use unittest TestCase tests). We found 5% of the 420 repositories we checked had at least one test that was not running due to this issue. The record for the most duplication for a given test name in the same scope goes to “test_service_mgr_runit”, which was used three times. This means that two tests were skipped. Somewhere there is a developer sleeping soundly erroneously thinking the test they wrote demonstrates the code works as expected!
Here’s a gist of the affected lines we found. We auto-generated simple fix pull requests. Here’s the Pull Requests we opened for the more interesting larger projects:
- Ansible (52.2k stars, 20k forks)
- OpenEdx (5.9k stars, 3.2k forks)
- Django Axes (1k stars, 500 forks)
- Taigaio (5.7k stars, 1.1k forks)
- Django Haystack (3.3k stars, 1.3k forks)
This sheer number of affected codebases (1 in 20 in the sample) demonstrates that this problem is not just theoretical and is a real problem affecting real codebases — some of them huge well-established ones with huge numbers of users, great coding standard and a strong software development life-cycle. See the full list on this gist here.
Impact and cause
It’s noteworthy that the CI builds started failing on approximately 25% of the pull requests once the skipped tests were renamed. This does not necessarily indicate the code is broken, but could indicate the test logic is wrong. A wrong test is still bad though, as a developer often read test to understand the expected behaviour of the code. A developer reading the test probably wont notice the test was not being ran, and could therefore be given bad expectation of how the product behaves.
Therefore the impact of this problem is maintainers get misinformed of what code is being tested, developers reading the tests get misinformed as to how the code is expected to work, and users are potentially shipped degraded functionality because the feature they are using may not be tested.
Linters such as pylint or flake8 should be able to detect this problem, but even this can fail if:
- Linter configuration is accidentally misconfigured to ignore certain files, turn off certain checks.
- Linter configured to turn off “annoying” messages without realising the full impact of the check being turned off.
- Linter configured to turn off “annoying” lint messages specifically for tests (they’re just tests after all…)
- CI may be accidentally misconfigured and not run the linter.
- Linter may simply not be setup. It’s a best practice to run, sure, but may simply not be used
Additionally humans are unlikely to detect such problems manually during code review as this is a class of error that humans can easily miss, and one the human is probably relying on a linter to do. Given enough time a process involving humans will always have human error, and that should be catered for rather than fought against. This is evident in the fact that we do code review at all: if we do code review because we expect human error when the code was written (developed) by a human,then surely it follows that we should also expect human error when the code is being read (reviewed) by a human. Any process that relies on zero human error will always fail. This is why linting is so useful, and linting misconfiguration so problematic.
The process of scanning batches of codebases comes in very useful part of Code Review Doctor’s software development life-cycle as it acts as a quality gate for detecting false-positives in our code scanning rules. When we write new code scanner rules we stress-test them by running them against hundreds of codebases then check what issues were found. In this case we ran against 420 codebases. We then manually identify false positives and fixed edge cases before they can annoy our users that rely on Code Review Doctor to review their GitHub pull requests. An output of this process is we then get to contribute to open source projects by opening PRs fixing the issues we find.
The simple fix that Code Review Doctor automatically suggests when it encounters duplicate test names is to suffix “_two” on the duplicate tests. A more descriptive name would be better, but naming is hard and coming up with a good name should be left to the developer that knows the context of the test. Having said that, a bad named test that runs is better than a test that does not run. Moreover, once the codebase maintainers know there is an issue they can choose a better name if they want, or simply accept the suffix suggestion as it solves the immediate problem of the test not running. With that in mind, this problem can be prevented by using our GitHub pull request checker that will suggest the test be renamed if the problem is encountered:
You can also check if this problem is in your codebase already by scanning it securely for free at CodeReview.doctor