Navigation
FIDA Blog
Knowledge - Success Stories - Whitepaper
newspaper Overview chevron_right Artificial Intelligence chevron_right Interview chevron_right Cross-industry
Ki Roboter
suldev
Interview

Why good data is more important than fancy models

Patrick Dylong, Data Scientist at FIDA Software and Associate Researcher at the University of Jena, explains in an interview why high-quality data is the real foundation of successful AI projects. He talks about typical data problems in practice, the hype surrounding complex models - and what really counts if AI is to work in a company.

You often hear: 'Garbage in, garbage out'. Why are data quality and data understanding more important for the success of data science projects than the latest model?

Data quality and understanding are the foundation of every data science project. Even the most complex algorithms only deliver valid results if the underlying data is prepared correctly. Incorrect data can lead to distorted models, high error rates and therefore potentially incorrect decisions. To specifically avoid this, it is important to deal with the quality and quantity of the available raw data early on in a project. This makes it possible to identify and eliminate potential weaknesses at an early stage. This not only facilitates subsequent model training, but in many cases also makes it possible to apply more complex algorithms and analysis methods in a meaningful way.

Many companies invest a lot in model optimization - but neglect feature engineering. Why do you think this step is so crucial?

Feature engineering translates raw data into more meaningful, machine-readable features. In many contexts, the existing raw data cannot be used directly or at least not optimally for model training because, for example, it is not available in the correct format. Furthermore, taking domain knowledge into account, there are often opportunities to combine or otherwise transform raw data in order to better match the content and/or technical aspects of the respective use case. For example, hidden correlations in the data can be made visible in advance and these can then be used directly for model training. Feature engineering thus supports model training by specifically enhancing the existing database.

How do projects with a good database differ from those with incomplete or dirty data - also in terms of effort, stability and results?

Projects with a solid database can generally start model training more quickly because there is no need for time-consuming data cleansing and the team can concentrate directly on content analysis and modeling. The resulting models are therefore more stable: they behave more predictably, even when new data is added, as outliers and inconsistencies in the content of the training data have already been eliminated in advance. As a result, projects with a good database often deliver more reliable results with better key figures. With incomplete and inconsistent data, on the other hand, the effort required for corrections increases and the model performance can fluctuate more.

What role does knowledge of the relevant specialist areas play when it comes to turning raw data into meaningful information?

Domain knowledge is very important when it comes to deriving thematically relevant features from existing raw data and interpreting results correctly. Domain knowledge provides the necessary context: for example, it explains industry-specific correlations, typical outliers in the data and any seasonal effects. Without this knowledge, there is a risk of mistaking statistical artifacts (i.e. purely random variations in the data) for real patterns. This also enables data scientists to formulate data-driven hypotheses more precisely and test the content of models in a targeted manner, especially in the increasingly important context of explainable AI.

If you had to give advice to a company with limited resources: should it invest in modeling or data maintenance first - and why?

In most cases, I would invest in data maintenance first. Simpler models only provide reliable predictions if the basic data is correct, complete and available in sufficient quantities. Investments in data cleansing, consistency checks and documentation help with subsequent model training and analyses: they reduce sources of error, lower maintenance costs in the long term and thus strengthen confidence in the external validity of the model, i.e. its performance outside of model training. Only with a good database is it worthwhile to invest in more complex modeling approaches, which can then further increase the efficiency and performance of the basic models.

About the Author

Paul Wettstein lenkt bei der FIDA die digitalen Marketingbereiche SEO, SEA und Social Ads in die richtige Spur. Als begeisterter Radsportler kombiniert er Ausdauer, Strategie und den Blick fürs Detail – Qualitäten, die ihn sowohl auf der Straße als auch in der digitalen Welt auszeichnen.