All Pro Web Designs > > Learning Tutorials > Programming Languages > BASIC > 7 Tips To Structure Your Python Data Science Projects

7 Tips To Structure Your Python Data Science Projects

November 4, 2023
Posted by: MainInstructor
Category: BASIC Data Science Go Python SQL

19 Comments

Video Title: 7 Tips To Structure Your Python Data Science Projects

Today I’ll cover 7 tips to help you better structure your data science projects. Now, if you’re doing a project where you need to analyse some data, often that doesn’t really come with a clear objective at the beginning, as you get more insights from your data, you’re going to add or remove features.

So you might have a tendency to think of your code as a throwaway, and that you don’t need to spend any time at all designing the software or thinking about how to set up the project properly spending time on writing clean code, but that

Will be a mistake, because the whole idea of properly setting up your projects thinking about software design is that it’s going to allow you to make changes more quickly and more easily. And especially with data science projects, you need to change your code regularly, and be able to do it quickly. So

After watching this video, you have no more excuses. The first step is to use a common structure for your various projects, you’re probably going to be involved in several projects that deal with data. And you probably also need to switch between them regularly. So it’s really helpful if you

Make sure to follow the same project structure everywhere to minimise your cost of switching context, especially if you’re part of a team, you may want to share that code with a colleague, if you can somehow agree among the team that you’re going to use the same kind of structure everywhere, it’s just

Going to make life so much easier. I found it’s really helpful to spend a bit of time with your team members to make sure that you agree on what a standard data science project should look like. And if you get the green, fight to the death,

It’s the only reasonable option. A very useful tool for doing this is cookiecutter. This allows you to start a project following a specific template. So ensure you always have the exact same starting point, you can use an existing template like cookiecutter data science, or you can create your own that

Follows the standards of your team. In general, it’s a good idea. That’s the second tip to use existing libraries wherever possible, you may be tempted to write your own code to process clean transform your data because it’s quicker. But know that the more code you write, the bigger the chances of

Introducing box that’s really interesting to me what happened as a look over my software development career over the past year, because you know, the more practical experience I gained developing software, the more I actually started relying on existing libraries. And that’s not just because I’m lazy, I am

Lazy. But it’s also because the existing libraries and packages have put some thought into how to organise everything so that the package solve as many problems as possible. And that’s really nice, because that means that in the future, when you extend your project, you add new features need to add new things,

Then you’re probably in a better position than way develop everything yourself from scratch. Another reason to use existing packages as much as possible is that these things have been probably, hopefully tested properly. So you spend less time testing your own code, because the existing libraries

Will or have already solved that for you. A lot of things that I really like about using existing libraries as much as possible is that actually allows me to learn a lot about the domain. By using a package like pandas, for example, you learn all about

What is a data frame? How does processing typically work? What kind of standards should you follow, because the makers of pandas have created all sorts of things to help you with that. So for me, it’s also a learning experience to use existing libraries. Now, of course, there are tonnes of helpful libraries

Like pandas, but also NumPy Scikit-learn, PyTorch, SQLAlchemy, lots of libraries that are useful for data science projects, use them one particular type of tool that’s very helpful. It’s a pipeline or workflow tool. This allows to structure your data in workflows and it makes it way more

Scalable. This is where for example, a platform like Taipy comes in. They’re also currently sponsoring this video. Taipy handles both forwards and backwards, it’s open source and you can use it for free to instal it simply type PIP instal Taipy or if you’re using poetry, simply type poetry add Taipy

And add it to your Pi project file. You can use Taipy in VS code directly by using the Taipy Studio extension and it also works in Jupyter Notebooks. The nice thing about using a tool that’s dedicated to running pipelines is that it offers a

Lot of features that you don’t don’t have to build yourself. For example, Taipy has scenarios, which is a sort of registry for all your pipeline runs. It also doubles as a great comparison tool for an F analysis because allows you to launch pipeline runs using different parameters. And this

Will help you take projects that you set up as pilot using just a simple machine learning model and make it available to users with a much higher quality model very easily. Now on top of that Taipy has many other features like parallelism caching, data scoping, and pipeline versioning. Go to Taipy’s GitHub

Page to check it out. I’ve also put the link in description now, back to the video. The third tip As to make sure you log your results. As you’re analysing data. And iterating. It’s really important that you log things. Because if something goes wrong, you want

To be able to see where that happened. And then go back a few steps to fix the problem. If you don’t log things that’s actually really hard to do. Now, tools like Taipy, can help you log your pipeline runs. But you might also want to keep track of

The various outputs. And there’s several ways in which you can do this, you can use log files. So those are just files that you store locally on your machine, but they’re pretty rudimentary solution. You also have log services like Papertrail that

Allow you to just send logs over the Internet to a cloud service, though, you might not want your logs to be stored on a server that you don’t own. This is actually one of the major ways in which data leaks, of course, so be very careful with that. If

You’re doing machine learning specifically, there are also tools like Comet ML that allow you to keep track of your experiments, visualise the performance of your models. The fourth tip is to not be afraid to use intermediate data representations, you don’t have to do all data processing in a

Single step, you can first do some pre processing stored in an intermediate file, even store some pre processed data in a database, and then take that to the next step. And not just for data science projects. By the way, this is something that has

Really helped me a lot in structuring and organising my code better, because it allows you to focus on a particular part of the job that you need to do. So as a first step, you can do pre processing, and make sure that the data is stored in a

Format that you can work with at a later stage. And by doing that you force yourself to think about just the pre processing parts and not having to think about the whole processing pipeline that you’re building out step by step. And the reason this works well is that different representations are

Optimised for different things. For example, if you have data in CSV or JSON format, this is really great because it’s human readable, it’s lightweight, you can send it over to other people without having to worry about them not being able to read it.

But if you need to query the data, then having a CSV or JSON file is less convenient, because there’s no easy search functionality apart from just basic file text search. So in that case, if you need to query data, or you want to store it,

Perhaps in a SQL database on them, that’s a better option. But that also has problems because then you have to deal with the added complexity of managing a database. And then if you have data frames that great for exploratory data analysis, because this has a really extensive API, but you may

Encounter realm limitations. By storing all of those things in memory, it might be slower for specific specialised operations like database operations. And there’s also a learning curve. Though, if you’re doing a lot of data analysis, knowing about data frames is a really good skill to have. So pick whatever

Is most suitable for the job. If you need multiple formats and different steps, that’s not a problem at all. It’s better to just convert the data, then try to work in your code around format that doesn’t work for you. So here’s a question, what

Kinds of data formats do you use? And why do we have any tips, share them in the comment section. Tip number five is to move code that you’re planning to reuse to a shared package. Especially if you use Jupyter Notebooks, it’s really easy to

Lose overfill once your code starts getting more complex, you might accidentally break things in a notebook. And there’s no easy way to reuse code between notebooks. If you take those parts of code that you want to reuse and put them in separate modules and import those in your notebook, then it’s actually way

Easier to manage. You could even create a package out of that and publish it yourself and then import that code that way, which saves you a lot of time working with Python code. It’s not a notebook also has other advantages, like being able to

Easily write unit tests for it, having also formatting, style fixes, et cetera, et cetera, all those things that you don’t have in a Jupyter Notebook. Now, I’m not an avid Jupyter Notebook user at all. So in that sense, my main experiences with simply

Writing Python code, but I do notice that whenever I use notebook, I always have the tendency to quickly get out of there and get back into regular code where I feel way more comfortable. But notebooks are pretty nice for exploratory analysis. So it’s good to integrate them into your

Workflow, but in a meaningful, careful way. Tip number six is that you want to move configuration settings into a separate file, you really want to keep configuration settings separate from the code, the worst thing you can do is to have the settings be spread out all over your codebase. And

That’s really easy to do, right you have your different modules and in each module, you just use some constants that you define all over the place. And this makes it really hard to make changes later on. Or if you need to deploy the code and you need

To change the settings because you want to connect to a different database or you need different paths or folder names if configuration settings Is are all over the place in your code, it’s really hard to find is gonna take you a tonne of time.

So the best thing you can do is to move all of those settings, at least to one single place, preferably should be outside of the code, which I think is the best solution. But if you want to define things in code, make sure that that happens in a

Single place. It’s kind of the idea of when you organise your code to make sure that there is a single dirty place in your code where you do all patching up and you define all the specific things that your code needs. Because then if

Everything’s in one place, it’s easy to switch things with different values, different constants, different ways of setting things up, what I typically do is that I store everything in environment variables, I might work with a local dot env file, so that I can easily define those

Variables whenever I’m working on my local machine. But the advantage of using environment variables that are also well integrated with Cloud tools, so if for example, you deploy a function, or a Docker container to the cloud, and it’s pretty easy to define a few environment variables, so that you can

Actually change the settings that your application should use. And by the way, pipeline tools like Taipy I also support running your code with different configuration settings. Now, if you want to get better at reviewing your code, detecting problems faster, so that you’re able to make changes like

Cleaning of your conflict settings, check out my free code diagnosis workshop, where I teach you a three-factor framework for reviewing code efficiently. While still identifying the main problems, you can sign up at arjan.codes/diagnosis, contains a lot of useful advice, practical code examples, you can apply right away to your own projects,

Arjan.codes/diagnosis. The link is also in the description of the video. Now, tip number seven final tip is to actually write unit tests. If you think you don’t need unit tests, because well, you can just take a look at the charts. Well think again,

The problem with not writing unit tests is that you have a much higher chance that you’re going to run into problems with your code later on. For example, if you need to run your code on a new dataset, which happens regularly, right. So in that

Case, it’s really problematic, because that means that whenever you run your code with a new set of data, that points when you’re actually trying to focus on something else, then you’re going to notice that there is a bug in your programme. And that

Will probably also be the time when you are on a deadline, you’re in a hurry, you need to make sure that you can perform that analysis quickly. You’ve also been maybe out of the code for several weeks, or even months. So it’s gonna take you

It’s almost time to get back into it again and try to fix the bug. If you write unit tests, while you’re developing the code that the unit test tests, then it’s actually much easier because that’s the moment when you’re focusing on the code.

That’s the moment when you can spend a bit of time writing unit tests and making sure that everything is stable, so that in the future, if you switch out the data set, or if you share your code with a colleague who uses it in a slightly different

Way, at least you have to test already in place to solve part of the issues, what you really want to avoid is that things are too dependent on you being involved in every step of the way, the whole idea of writing code is that it automates things

For you. But if you don’t write test books are going to pop up at the most inconvenient moment possible, and they are going to need you to fix them. So if you can already do part of that work before, it’s gonna save you a lot of trouble in the future.

Another reason to write us is that even though you may think that you can detect issues by just looking at the charts or some problems might be too small to show up in a chart, but still affect the result and therefore affect the decisions that you’re

Going to take based on your analysis. So it’s always good to think of your code a bit broader than just what you see showing up on the charts and make sure that things are robust and stable, especially if you move some of your code to separate

Package. And that code is actually a great candidate to write unit tests for you can even go full on test driven development and write the tests before you write the code. But for data science projects, that’s perhaps taking it a few steps too far. I hope you enjoyed this video. If you did,

Give us a like I said helps the YouTube algorithm recommend this content to other viewers as well. So I’d like to hear from you. Do you have other tips to help your fellow data gunslingers out some pandas plotting prowess sexy SQL statements or a nifty notebook noodles? Now I did talk about

Jupyter Notebooks a bit in this video and that you should move your code outside of your notebook at some point. But there are other issues with Jupyter notebooks that you need to be aware of as well to find out what those are and how you

Can address them. What’s this video next? Thanks for watching, and see you soon.

Video Keywords: Python, data science python,data science,data science projects,data science project,python data science project,data science project structure,data scientist,machine learning,data visualization,data science tutorial,data science for beginners,data science roadmap,data analyst,data analysis,data analytics,machine learning python,data science with python,data science roadmap 2023,data science programming,big data,data science projects for beginners

19 Comments

Loïc

November 4, 2023 at 8:50 pm Reply

Great video, and very great advice! I'll be giving a course about software development next semester and I think some of the points you talked about are worth mentioning !
RiptorForever

November 4, 2023 at 8:50 pm Reply

6:50 to query a 'json file' or a collection of 'json files', there is lib tinydb
Roberto Calzadilla

November 4, 2023 at 8:50 pm Reply

Official request for a full Taipy video 🙌
Drew Mailman

November 4, 2023 at 8:50 pm Reply

For Tip 5, nbdev from fastai is a great package for exporting cells from a Jupyter Notebook to a script. From my notes, ymmv:

At top:

#| default_exp folder_name_if_desired.file_name

Per cell exported:

#| export

To export, add below to same cell:

import nbdev

# For current directory path:

nbdev.export.nb_export("Notebook Name.ipynb", "./")
Vinicius Queiroz

November 4, 2023 at 8:50 pm Reply

I constantly use the Parquet data format. It makes loading data WAY faster. In Python, it works just as CSV (e.g., using Pandas, instead of using read_csv() you use read_parquet()). It is bundled with an intelligent way of compressing repeated values, so it has a way smaller HD memory footprint when compared to CSV and JSON. It stores data in a columnar fashion, so that if you only need some columns for a project, and other columns for another, you can avoid retrieving unwanted columns into memory. It also works well with Big Data environments (such as Apache Spark).

Having a smaller HD memory footprint means you can transfer to other people easily as well. And store it in cloud solutions with a lower cost.

And honestly, as a Data Scientist, you kind of never would open the CSV or JSON file and check it yourself. 99% of the time we use a library like Pandas or a software like Tableau to visualize and work with the data. So being human-readable is not really an advantage for data scientists, as it is for backend and frontend developers.
hub strangers

November 4, 2023 at 8:50 pm Reply

Thank you, could please return to LLMs for short series with MemGTP, OS and Function calls (YT, v=rxjsbUiuOFo, robot to robot interaction), if time permits, could be able to come up with demo and thought process, how futuristic is the scenario, and will it be a cost effect consideration, on prem cloud platform….. Thank you
arw

November 4, 2023 at 8:50 pm Reply

With using external libraries, you have to develop some heuristic of what makes a good, trustworthy package. Especially in the DS space, there is so much incredibly mangled code, often with insufficient tests. Relying on these is a risk, the question often is more if you are more trustworthy to write decent code.
prison9865

November 4, 2023 at 8:50 pm Reply

By the time you said what i can do with tipy, i already was not interested. Perhaps tell people what tipy can do for you and then how to install it and shit…
ringpolitiet

November 4, 2023 at 8:50 pm Reply

Polars scales great. Read the CSV and query, lazily if needed. Parquet for intermediate file system storage, polars.write_database if needed. "If you have to ask, polars is enough".
王欢

November 4, 2023 at 8:50 pm Reply

Many of my colleagues write code in notebooks, when deploying their work in production, they simply copy the code into a .py file, then push to gitlab. The worst thing in a data science team is that you are supposed to create data science results, no one cares about your code quality, even the team leader. The codebase quickly became messy and dirty. Those data scientists try various dirty ways to get things working. My leader told me that my code quality is the best in our team, but it is not necessary, we are going to create good machine learning features and models. Now, whenever I came to a new data science team, the first thing I will do is to share a link of Arjan Codes to all colleagues, let's learn coding from design patterns!!
Joe Wyndham

November 4, 2023 at 8:50 pm Reply

Can someone outline for me what benefits notebooks have over IDE development? I've recently switched from doing data science with an IDE in a typical software dev environment to using Databricks notebooks (due to a job change). I honestly can't see any benefit, but I can see a lot of drawbacks. In an IDE like Pycharm I can rapidly create experiments, I can visualise data AND I can write clean safe software. Notebooks put so many obstacles in the way of good development. What am I missing?
Juan Carlos Pizarro Méndez

November 4, 2023 at 8:50 pm Reply

As always, start from scratch. It's from zero, nothing, etc.
Erik S

November 4, 2023 at 8:50 pm Reply

Arjan can now fill the dutch city of Tilburg with his 200k subs! Impressive since he only passed the city of Breda (150k) a couple of months ago!
Congratulations Arjan!
Borut Flis

November 4, 2023 at 8:50 pm Reply

Great video. I feel data-science projects are rarely examined in terms of design/structure quality. I hope to see more videos about it in the future. Perhaps on writing tests, I sometimes lack ideas about how to test data-science code.
Musharraf Jarumi

November 4, 2023 at 8:50 pm Reply

This is the content I didn't know I needed. Pure gold.
Thomas Bröthaler

November 4, 2023 at 8:50 pm Reply

This was a really great video. I am a data scientist for over 2 years and was great to see that i already use developed a habit to use some points (probably because of other videos of you 😂❤) but also learned something new too!

Could you do the same thing for data engineering?? That would be awesome!
Luke Kurlandski

November 4, 2023 at 8:50 pm Reply

Tip Number 0: Don't use Notebooks
Buchi

November 4, 2023 at 8:50 pm Reply

Useful list of tips but I have additional tips we can derive by combining these tips.

tip 1 + tip 6: Use a common way for externalizing configurations

If each project externalizes configurations differently, for example, one uses a YAML file and another uses a .env file, it will be a nightmare for other people, particularly for engineers working on the deployment and the operation.
Joshua Cantin

November 4, 2023 at 8:50 pm Reply

Regarding Jupyter notebooks, a lot of things still work when opening the notebook in VS code, such as code formatters. You just may need to trigger it specifically in each cell (Alt+Shift+f for the black formatter, I believe).