Software Engineering for DS

Source

Software Engineering for Data Scientists, Catherine Nelson, April 2024, O'Reilly Media, Inc.

How to analyze code performance

Find bottlenecks

Handle runtime time complexity

Using Data Structures Effectively

Operation List Tuple Dict Set NumPy Array pandas DataFrame
Index Lookup Yes (O(1)) Yes (O(1)) No No Yes (O(1)) Yes (fast)
Key Lookup No (O(n)) No (O(n)) Yes (O(1)) Yes (O(1)) No (O(n)) Yes (fast)
Search by Value No (O(n)) No (O(n)) No (O(n)) Yes (O(1)) Yes (O(n)) Yes (O(n))
Insert Yes/O(n) No Yes (O(1)) Yes (O(1)) No (O(n)) No (O(n))
Delete No (O(n)) No Yes (O(1)) Yes (O(1)) No (O(n)) No (O(n))
Memory Efficiency Medium High Medium-High Medium-High High Low
Best Use General Fixed data Fast lookup Membership Numeric ops Tabular data
Key Feature static; immutable no inherent order slow if iterating through rows

Code Formatting, Linting, and Type Checking

Code formatting

Linting

Type Checking

= check the type of the input that a function is expecting to avoid potential errors

Testing Code

Basic test structure (from Pytest documentation)

  1. arrange: set up everything needed to run the function
  2. act: run the function
  3. assert: check if the result is as expected
  4. cleanup: cleanup the testing trace

Types of Tests

Testing for DS & ML

Design & Refactoring

Code Design

ML project structure

├── README.md 
├── requirements.txt 
│ 
├── notebooks 
│ ├── explore_data.ipynb 
│ └── try_regression_model.ipynb 
│ 
├── src 
│ ├── __init__.py 
│ ├── load_data.py 
│ ├── feature_engineering.py 
│ ├── model_training.py 
│ ├── model_analysis.py 
│ └── utils.py 
| 
├── tests 
│ ├── test_load_data.py 
│ ├── test_feature_engineering.py 
│ ├── test_model_training.py 
│ ├── test_model_analysis.py 
│ └── test_utils.py

Refactoring

Refactoring is the process of changing a software system in a way that does not alter the external behavior of the code, yet improves its internal structure. —Martin Fowler

Documentation

Deployment

Containers & Dockers

Cloud deployment

Security

What is security for software

Commonly used terms

Security risks & practices

Working in Software

The Software Development Lifecycle

  1. plan
  2. design
  3. build
  4. test
  5. deploy
  6. maintain

Software development

Technical roles in software inductry

My side notes

Jupyter Notebook version control solutions