Navigation
FIDA Blog
Knowledge - Success Stories - Whitepaper
newspaper Overview chevron_right Artificial Intelligence chevron_right Interview chevron_right Cross-industry
Ki Roboter
suldev
Interview

Why good data is more important than fancy models

Patrick Dylong, Data Scientist at FIDA Software and Associate Researcher at the University of Jena, explains in an interview why high-quality data is the real foundation of successful AI projects. He talks about typical data problems in practice, the hype surrounding complex models - and what really counts if AI is to work in a company.

You often hear: 'Garbage in, garbage out'. Why are data quality and data understanding more important for the success of data science projects than the latest model?

Data quality and understanding are the foundation of every data science project. Even the most complex algorithms only deliver valid results if the underlying data is prepared correctly. Incorrect data can lead to distorted models, high error rates and therefore potentially incorrect decisions. To specifically avoid this, it is important to deal with the quality and quantity of the available raw data early on in a project. This makes it possible to identify and eliminate potential weaknesses at an early stage. This not only facilitates subsequent model training, but in many cases also makes it possible to apply more complex algorithms and analysis methods in a meaningful way.

Many companies invest a lot in model optimization - but neglect feature engineering. Why do you think this step is so crucial?

Feature engineering translates raw data into more meaningful, machine-readable features. In many contexts, the existing raw data cannot be used directly or at least not optimally for model training because, for example, it is not available in the correct format. Furthermore, taking domain knowledge into account, there are often opportunities to combine or otherwise transform raw data in order to better match the content and/or technical aspects of the respective use case. For example, hidden correlations in the data can be made visible in advance and these can then be used directly for model training. Feature engineering thus supports model training by specifically enhancing the existing database.

How do projects with a good database differ from those with incomplete or dirty data - also in terms of effort, stability and results?

Projects with a solid database can generally start model training more quickly because there is no need for time-consuming data cleansing and the team can concentrate directly on content analysis and modeling. The resulting models are therefore more stable: they behave more predictably, even when new data is added, as outliers and inconsistencies in the content of the training data have already been eliminated in advance. As a result, projects with a good database often deliver more reliable results with better key figures. With incomplete and inconsistent data, on the other hand, the effort required for corrections increases and the model performance can fluctuate more.

What role does knowledge of the relevant specialist areas play when it comes to turning raw data into meaningful information?

Domain knowledge is very important when it comes to deriving thematically relevant features from existing raw data and interpreting results correctly. Domain knowledge provides the necessary context: for example, it explains industry-specific correlations, typical outliers in the data and any seasonal effects. Without this knowledge, there is a risk of mistaking statistical artifacts (i.e. purely random variations in the data) for real patterns. This also enables data scientists to formulate data-driven hypotheses more precisely and test the content of models in a targeted manner, especially in the increasingly important context of explainable AI.

If you had to give advice to a company with limited resources: should it invest in modeling or data maintenance first - and why?

In most cases, I would invest in data maintenance first. Simpler models only provide reliable predictions if the basic data is correct, complete and available in sufficient quantities. Investments in data cleansing, consistency checks and documentation help with subsequent model training and analyses: they reduce sources of error, lower maintenance costs in the long term and thus strengthen confidence in the external validity of the model, i.e. its performance outside of model training. Only with a good database is it worthwhile to invest in more complex modeling approaches, which can then further increase the efficiency and performance of the basic models.

About the Author

Paul Wettstein lenkt bei der FIDA die digitalen Marketingbereiche SEO, SEA und Social Ads in die richtige Spur. Als begeisterter Radsportler kombiniert er Ausdauer, Strategie und den Blick fürs Detail – Qualitäten, die ihn sowohl auf der Straße als auch in der digitalen Welt auszeichnen.

Related Articles

Titelbild Entwicklung Marketingportal
Use Case
Porting a marketing portal to modern architectures and technologies

Workflows with too many different applications? That doesn't have to be the case! In our case study, we show you how we solved this problem for our customer.

learn more
Titelbild Entwicklung eines CRM Tools
Use Case
Development of a central platform for networking activities

How do you bring hundreds of partners from science, business and administration onto a common platform - transparently, efficiently and in compliance with data protection regulations? Our mission: to create a central solution that makes relationships visible, structures interactions and networks knowledge in a targeted manner.

learn more
Deepfake
Blog
Deep fakes, fake news, what is still real? How can I recognize fake pictures?

In a world in which a photo no longer shows what really happened - but what someone wants you to believe - reality and falsification are becoming increasingly blurred. The targeted dissemination of misinformation in the form of manipulated images is making it increasingly difficult to perceive the truth.

learn more