New ask Hacker News story: Ask HN: How do I improve our data infrastructure?

Ask HN: How do I improve our data infrastructure?
152 by remilouf | 93 comments on Hacker News.
I was just hired as the first permanent data scientist in a big corporation. They’ve previously relied on consultants to build the infrastructure and the data science pipelines. We’re still around 10 people in the team. The code is not pretty to look at, but this is not our biggest problem. We inherited a weird infrastructure: a mix of files in HDF5 and Parquet format dumped in S3, read with Hive and Spark. Here are the current issues: - The volume does not require a solution that is this complex (we’re talking 100Gb max accumulated over the past 4 years) - It’s a mess: every time we onboard a new person we have to spend several days explaining where the data is. - There is no simple way to explore the data. - Data and code end up being duplicated: people working on several projects that require the same subset write their own transformation pipeline to get the same results. Am I the only person here who finds it completely insane? I was thinking about building a pipeline to dump the raw data in a Postgres and then build other pipelines to denormalize and aggregate the data for each project. The difficulty with this, and any data science project is to find the sweet spot between data that is fine-grained enough to allow to compute features, but fast enough to query to train models. I was thinking that in a first iteration, data scientists would explore their denormalized, aggregated data and create their own feature with code. As the project matures we could tweak the pipeline to compute the features. Do you have any experience with this? Finally, I love data science and I really don’t want to end up being the person who writes pipelines for everyone. Everyone else is a consultant, and they don’t have any incentive to care about the long-term impact of architecture choices: their management only evaluates delivery (graphs, model metrics, etc.). How do I go about raising awareness?

About Me

Welcome to our breaking news site, where you can stay up-to-date on the latest breaking news, top stories, and current events from around the world. Our team of experienced journalists and writers work tirelessly to bring you the latest and most accurate news on politics, business, sports, entertainment, health, science, technology, and the environment. With our easy-to-navigate site, you can quickly find the latest local, national, and international news, as well as in-depth coverage of world news. We are committed to delivering comprehensive and reliable news coverage, so you can stay informed on the latest developments and breaking news stories. Thank you for choosing our breaking news site as your go-to source for the latest news and top stories.

New ask Hacker News story: Ask HN: How do I improve our data infrastructure?

No comments

About Me

ads

ads

Blog Archive

Popular Posts

Translate

Recent Posts

Comments

ads

Categories

Tags

Featured Posts

Recent Posts

Recent in Sports

New ask Hacker News story: Ask HN: How do I improve our data infrastructure?

No comments

About Me

Subscribe To

ads

ads

Subscribe To

Blog Archive

Popular Posts

Translate

Recent Posts

Comments

ads

Categories

Tags

Featured Posts

Recent Posts

Recent in Sports