Why Data Matters: Lessons from Chasing Quality Datasets

June 13, 2025 2 minute read

Background

One of the most painful lessons I’ve learned during my AI journey is this:

Without good data, no model — no matter how powerful — will perform well.

Early on, I underestimated the importance of data quality. I thought that having a working model architecture and some basic training examples would be enough. But in reality, garbage in, garbage out applies more than ever in machine learning.

The Data Problem I Faced

When I began training models for various NLP and LLM tasks, I quickly hit a wall: I didn’t have enough data, or the data I had was noisy, irrelevant, or too small.

This led to several setbacks:

Models that overfit or failed to generalize
Validation loss stagnation
Poor real-world performance

What frustrated me most was that I wasn’t struggling with model training — I was struggling just to get meaningful input.

Strategies I Considered

To get around the lack of usable datasets, I explored several options:

1. Crawling Public Web Data

This seemed simple in theory: scrape content from websites using Python libraries like requests, BeautifulSoup, or Selenium. In practice:

It was slow
Many sites block bots
Formatting and cleaning were time-consuming

Still, for some static content, it worked better than expected. Especially when I needed full pages and had specific content in mind.

2. Using Public APIs

For some government or open data platforms, using APIs was an option. However:

API limits and authentication made it hard to scale
Schema documentation was often incomplete
Pagination logic was more complex than expected

In some cases, manually downloading pages was actually faster and more reliable.

What I Learned

Good data is everything. A small but clean dataset is more valuable than a massive but noisy one.
Manual effort is sometimes worth it. Crawling and cleaning data taught me how to recognize useful patterns and discard noise.
Tool choice matters. Understanding when to use APIs vs. when to crawl directly helped me build pipelines faster.

Realization

Machine learning is not just about models. It’s about data engineering, scraping, cleaning, and even collecting data ethically and legally. That’s what turns experiments into usable tools.

I now spend more time thinking:

Where does my data come from?
How reliable is it?
Is it representative of the problem I’m trying to solve?

Next Steps

Automate partial crawling for common patterns
Build a data quality checker before training
Share some cleaned datasets as part of my portfolio

Share on

X Facebook LinkedIn Bluesky

Zeu Park

Why Data Matters: Lessons from Chasing Quality Datasets

Background

The Data Problem I Faced

Strategies I Considered

1. Crawling Public Web Data

2. Using Public APIs

What I Learned

Realization

Next Steps

Share on

You May Also Enjoy

Oracle Free Tier Limitations: Regional Resource Exhaustion and Deployment Dilemmas

Is Using AI to Write Code Helping or Hurting My Long-Term Growth?

Why Feature Engineering and Domain Knowledge Outperform Fancy Models

What is LLM Fine-Tuning? Making the Model Speak Your Language