7 Tips To Structure Your Python Data Science Projects
- November 4, 2023
- Posted by: MainInstructor
- Category: BASIC Data Science Go Python SQL
Video Title: 7 Tips To Structure Your Python Data Science Projects
Today I’ll cover 7 tips to help you better structure your data science projects. Now, if you’re doing a project where you need to analyse some data, often that doesn’t really come with a clear objective at the beginning, as you get more insights from your data, you’re going to add or remove features.
So you might have a tendency to think of your code as a throwaway, and that you don’t need to spend any time at all designing the software or thinking about how to set up the project properly spending time on writing clean code, but that
Will be a mistake, because the whole idea of properly setting up your projects thinking about software design is that it’s going to allow you to make changes more quickly and more easily. And especially with data science projects, you need to change your code regularly, and be able to do it quickly. So
After watching this video, you have no more excuses. The first step is to use a common structure for your various projects, you’re probably going to be involved in several projects that deal with data. And you probably also need to switch between them regularly. So it’s really helpful if you
Make sure to follow the same project structure everywhere to minimise your cost of switching context, especially if you’re part of a team, you may want to share that code with a colleague, if you can somehow agree among the team that you’re going to use the same kind of structure everywhere, it’s just
Going to make life so much easier. I found it’s really helpful to spend a bit of time with your team members to make sure that you agree on what a standard data science project should look like. And if you get the green, fight to the death,
It’s the only reasonable option. A very useful tool for doing this is cookiecutter. This allows you to start a project following a specific template. So ensure you always have the exact same starting point, you can use an existing template like cookiecutter data science, or you can create your own that
Follows the standards of your team. In general, it’s a good idea. That’s the second tip to use existing libraries wherever possible, you may be tempted to write your own code to process clean transform your data because it’s quicker. But know that the more code you write, the bigger the chances of
Introducing box that’s really interesting to me what happened as a look over my software development career over the past year, because you know, the more practical experience I gained developing software, the more I actually started relying on existing libraries. And that’s not just because I’m lazy, I am
Lazy. But it’s also because the existing libraries and packages have put some thought into how to organise everything so that the package solve as many problems as possible. And that’s really nice, because that means that in the future, when you extend your project, you add new features need to add new things,
Then you’re probably in a better position than way develop everything yourself from scratch. Another reason to use existing packages as much as possible is that these things have been probably, hopefully tested properly. So you spend less time testing your own code, because the existing libraries
Will or have already solved that for you. A lot of things that I really like about using existing libraries as much as possible is that actually allows me to learn a lot about the domain. By using a package like pandas, for example, you learn all about
What is a data frame? How does processing typically work? What kind of standards should you follow, because the makers of pandas have created all sorts of things to help you with that. So for me, it’s also a learning experience to use existing libraries. Now, of course, there are tonnes of helpful libraries
Like pandas, but also NumPy Scikit-learn, PyTorch, SQLAlchemy, lots of libraries that are useful for data science projects, use them one particular type of tool that’s very helpful. It’s a pipeline or workflow tool. This allows to structure your data in workflows and it makes it way more
Scalable. This is where for example, a platform like Taipy comes in. They’re also currently sponsoring this video. Taipy handles both forwards and backwards, it’s open source and you can use it for free to instal it simply type PIP instal Taipy or if you’re using poetry, simply type poetry add Taipy
And add it to your Pi project file. You can use Taipy in VS code directly by using the Taipy Studio extension and it also works in Jupyter Notebooks. The nice thing about using a tool that’s dedicated to running pipelines is that it offers a
Lot of features that you don’t don’t have to build yourself. For example, Taipy has scenarios, which is a sort of registry for all your pipeline runs. It also doubles as a great comparison tool for an F analysis because allows you to launch pipeline runs using different parameters. And this
Will help you take projects that you set up as pilot using just a simple machine learning model and make it available to users with a much higher quality model very easily. Now on top of that Taipy has many other features like parallelism caching, data scoping, and pipeline versioning. Go to Taipy’s GitHub
Page to check it out. I’ve also put the link in description now, back to the video. The third tip As to make sure you log your results. As you’re analysing data. And iterating. It’s really important that you log things. Because if something goes wrong, you want
To be able to see where that happened. And then go back a few steps to fix the problem. If you don’t log things that’s actually really hard to do. Now, tools like Taipy, can help you log your pipeline runs. But you might also want to keep track of
The various outputs. And there’s several ways in which you can do this, you can use log files. So those are just files that you store locally on your machine, but they’re pretty rudimentary solution. You also have log services like Papertrail that
Allow you to just send logs over the Internet to a cloud service, though, you might not want your logs to be stored on a server that you don’t own. This is actually one of the major ways in which data leaks, of course, so be very careful with that. If
You’re doing machine learning specifically, there are also tools like Comet ML that allow you to keep track of your experiments, visualise the performance of your models. The fourth tip is to not be afraid to use intermediate data representations, you don’t have to do all data processing in a
Single step, you can first do some pre processing stored in an intermediate file, even store some pre processed data in a database, and then take that to the next step. And not just for data science projects. By the way, this is something that has
Really helped me a lot in structuring and organising my code better, because it allows you to focus on a particular part of the job that you need to do. So as a first step, you can do pre processing, and make sure that the data is stored in a
Format that you can work with at a later stage. And by doing that you force yourself to think about just the pre processing parts and not having to think about the whole processing pipeline that you’re building out step by step. And the reason this works well is that different representations are
Optimised for different things. For example, if you have data in CSV or JSON format, this is really great because it’s human readable, it’s lightweight, you can send it over to other people without having to worry about them not being able to read it.
But if you need to query the data, then having a CSV or JSON file is less convenient, because there’s no easy search functionality apart from just basic file text search. So in that case, if you need to query data, or you want to store it,
Perhaps in a SQL database on them, that’s a better option. But that also has problems because then you have to deal with the added complexity of managing a database. And then if you have data frames that great for exploratory data analysis, because this has a really extensive API, but you may
Encounter realm limitations. By storing all of those things in memory, it might be slower for specific specialised operations like database operations. And there’s also a learning curve. Though, if you’re doing a lot of data analysis, knowing about data frames is a really good skill to have. So pick whatever
Is most suitable for the job. If you need multiple formats and different steps, that’s not a problem at all. It’s better to just convert the data, then try to work in your code around format that doesn’t work for you. So here’s a question, what
Kinds of data formats do you use? And why do we have any tips, share them in the comment section. Tip number five is to move code that you’re planning to reuse to a shared package. Especially if you use Jupyter Notebooks, it’s really easy to
Lose overfill once your code starts getting more complex, you might accidentally break things in a notebook. And there’s no easy way to reuse code between notebooks. If you take those parts of code that you want to reuse and put them in separate modules and import those in your notebook, then it’s actually way
Easier to manage. You could even create a package out of that and publish it yourself and then import that code that way, which saves you a lot of time working with Python code. It’s not a notebook also has other advantages, like being able to
Easily write unit tests for it, having also formatting, style fixes, et cetera, et cetera, all those things that you don’t have in a Jupyter Notebook. Now, I’m not an avid Jupyter Notebook user at all. So in that sense, my main experiences with simply
Writing Python code, but I do notice that whenever I use notebook, I always have the tendency to quickly get out of there and get back into regular code where I feel way more comfortable. But notebooks are pretty nice for exploratory analysis. So it’s good to integrate them into your
Workflow, but in a meaningful, careful way. Tip number six is that you want to move configuration settings into a separate file, you really want to keep configuration settings separate from the code, the worst thing you can do is to have the settings be spread out all over your codebase. And
That’s really easy to do, right you have your different modules and in each module, you just use some constants that you define all over the place. And this makes it really hard to make changes later on. Or if you need to deploy the code and you need
To change the settings because you want to connect to a different database or you need different paths or folder names if configuration settings Is are all over the place in your code, it’s really hard to find is gonna take you a tonne of time.
So the best thing you can do is to move all of those settings, at least to one single place, preferably should be outside of the code, which I think is the best solution. But if you want to define things in code, make sure that that happens in a
Single place. It’s kind of the idea of when you organise your code to make sure that there is a single dirty place in your code where you do all patching up and you define all the specific things that your code needs. Because then if
Everything’s in one place, it’s easy to switch things with different values, different constants, different ways of setting things up, what I typically do is that I store everything in environment variables, I might work with a local dot env file, so that I can easily define those
Variables whenever I’m working on my local machine. But the advantage of using environment variables that are also well integrated with Cloud tools, so if for example, you deploy a function, or a Docker container to the cloud, and it’s pretty easy to define a few environment variables, so that you can
Actually change the settings that your application should use. And by the way, pipeline tools like Taipy I also support running your code with different configuration settings. Now, if you want to get better at reviewing your code, detecting problems faster, so that you’re able to make changes like
Cleaning of your conflict settings, check out my free code diagnosis workshop, where I teach you a three-factor framework for reviewing code efficiently. While still identifying the main problems, you can sign up at arjan.codes/diagnosis, contains a lot of useful advice, practical code examples, you can apply right away to your own projects,
Arjan.codes/diagnosis. The link is also in the description of the video. Now, tip number seven final tip is to actually write unit tests. If you think you don’t need unit tests, because well, you can just take a look at the charts. Well think again,
The problem with not writing unit tests is that you have a much higher chance that you’re going to run into problems with your code later on. For example, if you need to run your code on a new dataset, which happens regularly, right. So in that
Case, it’s really problematic, because that means that whenever you run your code with a new set of data, that points when you’re actually trying to focus on something else, then you’re going to notice that there is a bug in your programme. And that
Will probably also be the time when you are on a deadline, you’re in a hurry, you need to make sure that you can perform that analysis quickly. You’ve also been maybe out of the code for several weeks, or even months. So it’s gonna take you
It’s almost time to get back into it again and try to fix the bug. If you write unit tests, while you’re developing the code that the unit test tests, then it’s actually much easier because that’s the moment when you’re focusing on the code.
That’s the moment when you can spend a bit of time writing unit tests and making sure that everything is stable, so that in the future, if you switch out the data set, or if you share your code with a colleague who uses it in a slightly different
Way, at least you have to test already in place to solve part of the issues, what you really want to avoid is that things are too dependent on you being involved in every step of the way, the whole idea of writing code is that it automates things
For you. But if you don’t write test books are going to pop up at the most inconvenient moment possible, and they are going to need you to fix them. So if you can already do part of that work before, it’s gonna save you a lot of trouble in the future.
Another reason to write us is that even though you may think that you can detect issues by just looking at the charts or some problems might be too small to show up in a chart, but still affect the result and therefore affect the decisions that you’re
Going to take based on your analysis. So it’s always good to think of your code a bit broader than just what you see showing up on the charts and make sure that things are robust and stable, especially if you move some of your code to separate
Package. And that code is actually a great candidate to write unit tests for you can even go full on test driven development and write the tests before you write the code. But for data science projects, that’s perhaps taking it a few steps too far. I hope you enjoyed this video. If you did,
Give us a like I said helps the YouTube algorithm recommend this content to other viewers as well. So I’d like to hear from you. Do you have other tips to help your fellow data gunslingers out some pandas plotting prowess sexy SQL statements or a nifty notebook noodles? Now I did talk about
Jupyter Notebooks a bit in this video and that you should move your code outside of your notebook at some point. But there are other issues with Jupyter notebooks that you need to be aware of as well to find out what those are and how you
Can address them. What’s this video next? Thanks for watching, and see you soon.
Video Keywords: Python, data science python,data science,data science projects,data science project,python data science project,data science project structure,data scientist,machine learning,data visualization,data science tutorial,data science for beginners,data science roadmap,data analyst,data analysis,data analytics,machine learning python,data science with python,data science roadmap 2023,data science programming,big data,data science projects for beginners
-
Sale!
Wireless WIFI Repeater Extender Amplifier Booster 300Mbps
$29.99$14.99 Add to cartWireless WIFI Repeater Extender Amplifier Booster 300Mbps
Categories: Electronics, Wi-Fi Router, Wireless Wi-Fi Extender Tags: 300Mbps, 802.11N, Amplifier, Booster, Extender, mobile wi-fi booster, Remote, WIFI, Wireless, Wireless WIFI, Wireless WIFI Repeater, Wireless WIFI Repeater Extender, Wireless WIFI Repeater Extender Amplifier, Wireless WIFI Repeater Extender Amplifier Booster, Wireless WIFI Repeater Extender Amplifier Booster 300Mbps$29.99$14.99 -
Sale!
Full RGB Light Design Gaming Headset Headphones with Mic
$24.99$14.99 Add to cartFull RGB Light Design Gaming Headset Headphones with Mic
Categories: Electronics, Gaming, Gaming Headsets Tags: Design, Full, Full RGB Light Design Gaming Headset, Full RGB Light Design Gaming Headset Headphones, Full RGB Light Design Gaming Headset Headphones with Mic, Gamer, Gaming, Gaming Headset Headphones, gaming headset wireless, Headphone, Headphones, Headset, Light, Mic, Package, RGB$24.99$14.99 -
Sale!
Wireless BlueTooth Multi-Device Keyboard Mouse Combo
$39.99$19.99 Add to cartWireless BlueTooth Multi-Device Keyboard Mouse Combo
Categories: Electronics, Gaming, Gaming Keyboards, Keyboard Mouse Combos Tags: Combo, Keyboard, keyboard mouse combos, Mouse, MultiDevice, Set, WireKeyboard Mouse Combo, Wireless, Wireless BlueTooth Keyboard Mouse Combo, Wireless BlueTooth Keyboard Mouse Combos, Wireless BlueTooth Multi-Device Keyboard Mouse Combo, Wireless BlueTooth Multi-Device Keyboard Mouse Combos$39.99$19.99 -
Sale!
High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
$199.99$139.99 Add to cartHigh Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
Categories: Gaming, Gaming Chairs Tags: Adjustable, Chair, computer chairs, Desk, Executive, Gaming, Girl, Headrest, High, High Back Leather Executive Adjustable Swivel Gaming Chair, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar, High Back Leather Executive Adjustable Swivel Gaming Chairs, Leather, Lumbar, Office, Racing, Swivel$199.99$139.99 -
Sale!
Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
$29.99$19.99 Select optionsProfessional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
SKU: N/A Categories: Electronics, Gaming, Gaming Headsets Tags: Cancelling, Gaming, Gaming Headphones with Noise Cancelling Microphone, gaming headset, Headphones, Headset, LED, Light, Mic, Microphone, Noise, Professional, Professional LED Light Wired Gaming Headphones, Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone, Wired, Wired Gaming Headphones, Wired Gaming Headphones with Noise Cancelling Microphone$29.99$19.99 -
Sale!
Gaming Desk with LED Lights USB Power Outlets and Charging Ports
$349.99$249.99 Select optionsGaming Desk with LED Lights USB Power Outlets and Charging Ports
SKU: N/A Categories: Computer Desk, Gaming, Gaming Desk Tags: and Charging Ports, Charging, Desk, Desks, Gaming, gaming desk with led lights, Gaming Desks with LED Lights, Home, LED, Lights, Monitor, Office, Outlets, Port, Power, Room, Stand, USB, USB Power Outlets, White, Workstation$349.99$249.99 -
Sale!
Wired Mixed Backlit Anti-Ghosting Gaming Keyboard
$99.99$79.99 Add to cartWired Mixed Backlit Anti-Ghosting Gaming Keyboard
Categories: Electronics, Gaming, Gaming Keyboards Tags: Antighosting, Backlit, Blue, brown, Gaming, Gaming Keyboard, gaming keyboards, gaming keyboards and mouse, Keyboard, Laptop, Switch, Wired, Wired Mixed Backlit Anti-Ghosting Gaming Keyboard, Wired Mixed Backlit Anti-Ghosting Gaming Keyboards, Wired Mixed Backlit Gaming Keyboard$99.99$79.99 -
Sale!
Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
$119.99$59.99 Add to cartWireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
Categories: Electronics, Gaming, Gaming Headsets Tags: 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, ANC, Audio, Bluetooth, Cancellation, Ear, Earphone, gaming headset, Headphones, Headset, Hi-Res Over the Ear Headphones Headset, HiRes, Noise, Wireless, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Headphones, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headsets$119.99$59.99 -
Sale!
Wired Sports Gaming Headset Earbuds with Microphone
$19.99$9.99 Select optionsWired Sports Gaming Headset Earbuds with Microphone
SKU: N/A Categories: Gaming, Gaming Headsets Tags: Accessories, Earbud, Earphone, Earphones, Gaming, gaming headset with microphone, Headphones, Headset, IOS, Microphone, Sports, Wired, Wired Sports Gaming Headset Earbuds, Wired Sports Gaming Headset Earbuds with Microphone, Wired Sports Headset Earbuds$19.99$9.99 -
Sale!
150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
$49.99$29.99 Add to cart150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
Categories: Charging Stations, Electronics Tags: 150W, 150W Charging Station, 150W Universal Multi USB Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Stations, 150W Universal Multi USB MAX Charging Station, 16 Port MAX Charging Station, 3.5A, Charger, Charging, Fast, laptop charging stations, Max, Multi, Port, Stand, Station, Universal, USB$49.99$29.99
Great video, and very great advice! I'll be giving a course about software development next semester and I think some of the points you talked about are worth mentioning !
6:50 to query a 'json file' or a collection of 'json files', there is lib tinydb
Official request for a full Taipy video 🙌
For Tip 5, nbdev from fastai is a great package for exporting cells from a Jupyter Notebook to a script. From my notes, ymmv:
At top:
#| default_exp folder_name_if_desired.file_name
Per cell exported:
#| export
To export, add below to same cell:
import nbdev
# For current directory path:
nbdev.export.nb_export("Notebook Name.ipynb", "./")
I constantly use the Parquet data format. It makes loading data WAY faster. In Python, it works just as CSV (e.g., using Pandas, instead of using read_csv() you use read_parquet()). It is bundled with an intelligent way of compressing repeated values, so it has a way smaller HD memory footprint when compared to CSV and JSON. It stores data in a columnar fashion, so that if you only need some columns for a project, and other columns for another, you can avoid retrieving unwanted columns into memory. It also works well with Big Data environments (such as Apache Spark).
Having a smaller HD memory footprint means you can transfer to other people easily as well. And store it in cloud solutions with a lower cost.
And honestly, as a Data Scientist, you kind of never would open the CSV or JSON file and check it yourself. 99% of the time we use a library like Pandas or a software like Tableau to visualize and work with the data. So being human-readable is not really an advantage for data scientists, as it is for backend and frontend developers.
Thank you, could please return to LLMs for short series with MemGTP, OS and Function calls (YT, v=rxjsbUiuOFo, robot to robot interaction), if time permits, could be able to come up with demo and thought process, how futuristic is the scenario, and will it be a cost effect consideration, on prem cloud platform….. Thank you
With using external libraries, you have to develop some heuristic of what makes a good, trustworthy package. Especially in the DS space, there is so much incredibly mangled code, often with insufficient tests. Relying on these is a risk, the question often is more if you are more trustworthy to write decent code.
By the time you said what i can do with tipy, i already was not interested. Perhaps tell people what tipy can do for you and then how to install it and shit…
Polars scales great. Read the CSV and query, lazily if needed. Parquet for intermediate file system storage, polars.write_database if needed. "If you have to ask, polars is enough".
Many of my colleagues write code in notebooks, when deploying their work in production, they simply copy the code into a .py file, then push to gitlab. The worst thing in a data science team is that you are supposed to create data science results, no one cares about your code quality, even the team leader. The codebase quickly became messy and dirty. Those data scientists try various dirty ways to get things working. My leader told me that my code quality is the best in our team, but it is not necessary, we are going to create good machine learning features and models. Now, whenever I came to a new data science team, the first thing I will do is to share a link of Arjan Codes to all colleagues, let's learn coding from design patterns!!
Can someone outline for me what benefits notebooks have over IDE development? I've recently switched from doing data science with an IDE in a typical software dev environment to using Databricks notebooks (due to a job change). I honestly can't see any benefit, but I can see a lot of drawbacks. In an IDE like Pycharm I can rapidly create experiments, I can visualise data AND I can write clean safe software. Notebooks put so many obstacles in the way of good development. What am I missing?
As always, start from scratch. It's from zero, nothing, etc.
Arjan can now fill the dutch city of Tilburg with his 200k subs! Impressive since he only passed the city of Breda (150k) a couple of months ago!
Congratulations Arjan!
Great video. I feel data-science projects are rarely examined in terms of design/structure quality. I hope to see more videos about it in the future. Perhaps on writing tests, I sometimes lack ideas about how to test data-science code.
This is the content I didn't know I needed. Pure gold.
This was a really great video. I am a data scientist for over 2 years and was great to see that i already use developed a habit to use some points (probably because of other videos of you 😂❤) but also learned something new too!
Could you do the same thing for data engineering?? That would be awesome!
Tip Number 0: Don't use Notebooks
Useful list of tips but I have additional tips we can derive by combining these tips.
tip 1 + tip 6: Use a common way for externalizing configurations
If each project externalizes configurations differently, for example, one uses a YAML file and another uses a .env file, it will be a nightmare for other people, particularly for engineers working on the deployment and the operation.
Regarding Jupyter notebooks, a lot of things still work when opening the notebook in VS code, such as code formatters. You just may need to trigger it specifically in each cell (Alt+Shift+f for the black formatter, I believe).