Online Magazine
Write clean code with the 5 SOLID Principles

As a data scientist, you have enough on your plate and shouldn’t be obliged to clean your code on top of that, right? Maybe you shouldn’t, but you and the whole data science project would benefit from a clean codebase. Let me show you 5 simple principles that help you produce it.
by Anel Music

It is true: as a data scientist, you perform a huge pile of different tasks: handling messy datasets, making sense of the data by asking the right questions, evaluating which algorithms and statistical methods might be suited to model that data, conducting experiments and finally, communicating with engineers, domain experts as well as managers and clients. So at least writing clean code and “productionalizing” your Jupyter Notebooks might be taken care of by someone else – like software engineers and machine learning (ML) engineers – right?
Well, yes and no. Of course, ML engineers are responsible for bringing the model into production. However, there are at least 2 good reasons why you should still improve your coding over time:
- As a data scientist, you might aspire to become an ML engineer yourself one day – and writing clean code can be seen as a preparation for this step.
- Even if that is not the case, clean code, which does not have to be substantially rewritten or refactored first, shortens the model deployment time and feedback loop. This, in the end, benefits all, as it leads to faster iteration, faster deployment, faster improvement, and faster customer satisfaction.
Now, that we have established that it makes sense to improve your coding how do we even define clean code?
In a nutshell: What is clean code?
Put simply, clean code is easy to read, easy to use, easy to extend and easy to test.
There are 5 software design principles called the SOLID Principles that help you write such code. At first, you might have to force yourself to comply with these principles, but once you have internalized them, you will implement them without really thinking about it. So, let’s go through each of the 5 SOLID Principles and see what they entail.
SOLID Principle #1: Single Responsibility Principle
The idea: Your class should have only one job.
What it means: You might have already heard of the so-called God object. The God object is the instance of a class that can do virtually anything. In the data science context, this might be a class that reads data, performs preprocessing, trains and evaluates models, makes new predictions and possibly even handles postprocessing. Although this might sound like a very potent and amazing class, concentrating the responsibilities of many completely different tasks in one class has many disadvantages. For one, the class becomes very large, and it becomes difficult to oversee the effects of changes. Fortunately, refactoring God classes is very easy.
Example: Let’s assume that you have a classifier class as shown in Figure 1. The classifier has 2 member variables (name and performance) and 2 methods for predicting and updating a simple model performance dashboard. I think we can all agree that a classifier should not be responsible for updating the dashboard.
Figure 1: Single Responsibility Principle violated.
The solution: We can resolve the violation of the Single Responsibility Principle fairly easily by delegating the dashboard responsibility to a separate dashboard class (see Figure 2). To do so, we simply introduce a new class called “Dashboard” and use its update method to update the dashboard. Due to the strictly separated responsibilities, our classes become much shorter as well as easier to explain and understand. In fact, this is one of the reasons why micro-services have become such a popular architecture.
Figure 2: Single Responsibility Principle reestablished.
SOLID Principle #2: Open Closed Principle
The idea: Your class should be open for extension but closed for modification.
What it means: Every code is open for extension which means that you can always add new features. However, ideally, you should be able to add these features without in any way changing the existing code. Changing existing code not only carries the danger of introducing new bugs but might also requires you to extend already existing unit tests. This can be difficult – especially if you don’t fully understand what the function you’ve extended does.
Example: Imagine you are presented with the store_data method shown in Figure 3. Depending on the storage_type, this function either stores data in a SQL database or a CSV file. You want to add a new feature that allows you to store the data to a MongoDB database. The simplest way to do this would be to add another if-condition to the store_data method, check for (storage_type == “mongodb”) and, if satisfied, execute some //store to mongodb code.
This would work perfectly fine; however, it would violate the Open Closed Principle since adding your new feature (extension) would require the change of already existing code (modification).
Figure 3: Open Closed Principle violated.
The solution: Figure 4 shows one way to fix the Open Closed Principle violation. As you can see, the DatasetManager no longer has a store_data method but a member storer of type DataStorer – an interface that defines what different types of DataStorers have in common. In this case, all DataStorers have a store_data() method.
If you want to write your data into an SQL database, you can create a class SQLStorer that inherits from the DataStorer interface (implements the interface DataStorer) and implements the store_data() function .
If you want to add a new feature for writing the data into a CSV file, you don’t need to change existing code. You can create a new class CSVStorer that also inherits from DataStorer and defines the store_data() method. Similarly, adding a new MongoDB feature only requires you to again extend the code by providing a new MongoDBStorer class without modifying existing classes or functions.
You’ve probably realized that the DataStorer interface provides a common “template” for all types of DataStorers (SQL, CSV, MongoDB, S3, BlobStorage, etc.). As our DatasetManager depends on a generic storer object of type DataStorer, you can pass any DataStorer subclass object to it. This has the advantage that if you change the way you store your data (e.g., from csv to S3), you can simply pass a S3Storer object instead of a CSVStorer object to your DatasetManager constructor without breaking the client code.
In general, your classes should always depend on abstractions (interfaces) and not on implementations (concrete classes).
Figure 4: Open Closed Principle reestablished.
Have you heard of DataOps?
It is now easy for companies to accumulate heaps of data – but it is less easy to gain insights from it quickly and scalably. One possible solution is DataOps.
Learn more about it in this article!
SOLID Principle #3: Liskov Substitution Principle
The idea: You should be able to replace a parent class object by any child class object without altering the correctness of your code.
What it means: If you are somewhat puzzled by this formal definition of the Liskov Substitution Principle, don’t worry – you are not the only one.
Despite what it might look like at first glance, the Liskov Substitution Principle is actually rather straightforward, as it works with concrete checkpoints. Therefore, two programmers will never disagree on whether the principle is violated or not (which can happen e.g., regarding the Single Responsibility Principle). This also means that modern linters such as mypy can help you identify such violations. However, to resolve them, it’s still important to understand the idea behind the principle.
In my opinion, a simple before and after illustration would not suffice here. Thus, I will go into a little more detail.
Example: Let’s say, you want to be able to pay for an order you made. A simple implementation is shown in Figure 5. Here, you have an ApplePay class that is responsible for processing the payment. To do so, the ApplePay class has a pay() method that receives the order and a phone number that has to be verified using some sort of verification procedure within the pay() method.
Concluding this procedure sets the order status to ‘paid’. The PaymentProcessor interface serves as “template” and should be implemented by all kinds of different PaymentProcessors.
On the left side, you can see the fairly simple client code. First, you instantiate the order class and add a keyboard to your order. Then, you instantiate the ApplePay class and call the pay() method using your order and phone number.
Figure 5: Classes without Liskov Substitution Principle violation.
Now, let’s assume, you want to add new feature that allows payment via PayPal. You can implement your PaymentProcessor interface and create a new class called PayPalPay as shown in Figure 6. For PayPalPay, you would implement some sort of verify-nr procedure in the pay() method and set the order.status to ‘paid’. The client code almost doesn’t change. So far, nothing new and also no Liskov violation.
Figure 6: Classes without Liskov Substitution Principle violation.
Unfortunately, PayPal doesn’t work with phone number verification. Instead, it uses an email address to verify an account as illustrated in Figure 7:
Figure 7: Classes without Liskov Substitution Principle violation.
A quick remedy for this challenge is shown in Figure 8. Instead of passing the phone number in the client code payer.pay(order, ‘+491520000’) call, you could simply pass an email address pay.pay(order, ‘abc@def.com’) and – instead of a phone number verification procedure – implement an email address verification procedure. You only have to remember that the parameter phone_nr does not hold a phone number but rather an email address. Sadly, you can’t change the parameter name from phone_nr to email_adress because you’re adhering to the PaymentProcessor interface.
Figure 8: Liskov Substitution Principle violated.
This would definitely work, produce the output you expect and it shouldn’t cause any problems, if …
… you remember that the phone_nr in the pay() method of the PayPalPay class is an email address,
… you misuse this parameter phone_nr for your purposes in the email verification procedure by treating it like an email address,
… you remember to pass an email address instead of a phone number to the pay() method in the client code when using the PayPalPay class, and
… no one ever by accident passes a phone number to the pay() method of a PayPalPay object which would cause an error in the email verification procedure of the pay() method in the PayPalPay class.
Way too many “ifs” — if you ask me.
As you can see, violating the Liskov Substitution Principle even for this simple example results in a variety of problems. These basically occurred because your child class objects could not be used interchangeably. To be more precise: You can’t exchange the payer objects in the two client code snippets above because the way the pay() method is called depends on which class (ApplePay or PayPalPay) you instantiate.
Figure 9: Liskov Substitution Principle reestablished.
The solution: Figure 9 illustrates how to resolve the violation. To no longer misuse the phone_nr parameter as an email_address, remove it from the pay() method in the PaymentProcessor interface. This way, irrespective of the class you instantiate, each client code call of the payer.pay(order) method will look exactly the same because a second parameter (phone_nr/email_adress) is no longer required. Whether the verification is using an email address, or a phone number is now encoded in the constructor itself.
ApplePay now has a member phone_nr, and PayPalPay has a member email_addr. The verification procedure inside the pay() method accesses the corresponding member (phone_nr/email_adress) accordingly.
You’re no longer violating the Liskov Substitution Principle, and therefore, you can replace the payer object created using the ApplePay class and the payer object created using the PayPalPay class if needed. Now, you can call the pay(order) method on any payer object and won’t make mistakes that might lead to errors and crashes.
SOLID Principle #4: Interface Segregation Principle
The idea: It’s better to have multiple specific interfaces instead of one big general interface.
What it means: As with the aforementioned God object – at first, having one large interface that declares all the methods that subclasses might want to implement sounds practical, but if you think about it, it is actually the complete opposite.
Example: Figure 10 shows what happens, if you have interfaces that are too general: The ImgSegmenter interface provides a common “template” for all ImgSegmenter subclasses. If you want to create a concrete class that inherits from ImgSegmenter, you need to provide an implementation for all abstract methods (segment_semantics, segment_instances) declared in the ImgSegmenter interface. Otherwise, the compiler (or interpreter) will throw an error when you try to create an object.
This is the case with DeepLab for example: DeepLab is a semantic segmentation algorithm. Thus, you can only provide the implementation for the segment_semantics() method. However, inheriting from an interface forces you to provide an implementation for the segment_instances() method too. As a workaround, you can use a python pass or (a bit better) raise an exception to indicate that DeepLab can only be used for semantic segmentation. (In contrast, the MaskRCNNSuper algorithm can perform instance segmentation and semantic segmentation, so you don’t have to raise any exceptions.)
Figure 10: Interface Segregation Principle violated.
The solution: A better design choice is illustrated in Figure 11. After refactoring, we have two specific interfaces (ImgSegmenter and InstanceSegmenter). Both can be implemented fully by their respective subclasses (MaskRCNNSuper and DeepLab). There is no need to raise exceptions. MaskRCNNSuper can be used for semantic segmentation and instance segmentation and therefore implements the InstanceSegmenter interface that has the segment_instances() method and inherits the segment_semantics() method from its parent class ImgSegmenter. DeepLab, in contrast, works only for semantic segmentation and thus only implements the ImgSegmenter interface.
As both of the concrete classes (MaskRCNNSuper and DeepLab) have the same super class ImgSegmenter (thanks to Polymorphism), you can pass objects of both classes to the constructor of your modelling class. I’d like to emphasize once again, that your classes should always depend on abstractions (Interfaces) and not on an implementation (concrete class).
Figure 11: Interface Segregation Principle reestablished.
SOLID Principle #5: Dependency Inversion Principle
The idea: Classes should depend on abstraction and not on concrete subclasses.
What it means: I have already stated it a few times: classes should always depend on abstractions and never on a concrete implementation. Let’s try to understand what is really behind this principle.
Example: Figure 12 shows a modelling class that is directly dependent on a DeepNN class because it has a member algorithm of type DeepNN. Inside its fit_data() method it calls the fit_deepNN() method of the DeepNN class.
Figure 12: Dependency Inversion Principle violated.
Let’s say the requirements have changed (e.g., less powerful hardware than expected is available), and you need to use a way faster model such as logistic regression. For this, you can create a new LogReg class as shown in Figure 13. If you now pass an object of type LogReg to the constructor of the modelling class, the code will break because the modelling constructor expects an algorithm of type DeepNN. Also, inside its fit_data() method fit_deepNN() is called which is only available in the DeepNN class and not in the LogReg class.
Figure 13: Dependency Inversion Principle violated.
To fix the problem, you can change the modelling constructor so that it expects an algorithm of type LogReg as shown in Figure 14. (Which by the way violates the Open Closed Principle). In addition, you need to change the implementation of the fit_data() method to call algorithm.fit_LogReg() instead of algorithm.fit_deepNN().
Figure 14: Dependency Inversion Principle violated.
You might think that this small change is not a big deal, but try to think about what would happen, if the requirements changed once again – if, for example, the number of features within the dataset increased, whereas the number of observations decreased. In this scenario, a SVM Classifier might be more suited. You would again have to change the modelling class constructor and make it’s fit_data() method call the fit_svm() method of a new SVM class.
The problem here is that your modelling class depends on a concrete implementation (a subclass). This means every time you pass a different class type (deepNN, logReg, SVM) to the constructor, you have to change the modelling class. Wouldn’t it be great to be able to pass objects to the constructor of the modelling class without changing the type it expects over and over again? This we can achieve quite easily by inverting its dependency.
The solution: The refactored solution in Figure 15 inverts the dependency by making modelling dependent on an abstraction (an interface) instead of a concrete class (an implementation).
This way, the constructor expects an object of type model instead of DeepNN or LogReg. DeepNN and LogReg are now concrete subclasses that implement the model interface and create objects that can be passed on to modelling. As both concrete classes (DeepNN and LogReg) follow the interface model, both must implement a fit() method. This also means that irrespective of the type of the object passed to the modelling constructor, what’s inside the fit_data(data) method remains unchanged (algorithm.fit()).
Figure 15: Dependency Inversion Principle reestablished.
Conclusion
In order to write clean code, there are 5 principles you can adhere to (the so-called SOLID Principles):
- Single Responsibility Principle: Your class should have only one job.
- Open Closed Principle: Your class should be open for extension but closed for modification.
- Liskov Substitution Principle: You should be able to replace a parent class object by any child class object without altering the correctness of your code.
- Interface Segregation Principle: It’s better to have multiple specific interfaces instead of one big general interface.
- Dependency Inversion Principle: Classes should depend on abstraction and not on concrete subclasses.
Although these principles may seem intimidating at first, they quickly become second nature. They make your code modular, more readable, easier to understand, and easier to test. This in turn, benefits everyone in a data science project, as model deployment time and feedback loops get significantly shorter leading to faster iteration, faster deployment, faster improvement, and faster customer satisfaction.
