Twitch - Pipeline Data Engineer
It’s been a while since I last wrote here, so let me share what I’m doing now.
My internship program has finished, I obtained an AWS certification, and I’m very proud of myself—it feels good to meet challenging goals. Now let’s talk about my personal project.
Recently I came up with an idea to build a streaming‑platform pipeline with generative AI to check whether streamers comply with fair‑use and copyright terms. The pipeline would collect data such as user ID, user name, stream ID, viewer count, game name, start time, “is mature” flag, and thumbnail URL. These fields are a good starting point for the analysis: they tell us who the streamer is, how many viewers they have, and what game/category they’re in. Streamers sometimes switch to an unpopular category to avoid detection; unpopular categories tend to generate fewer reports. I chose Twitch because its API is well‑organized and robust, and I’ve already learned a lot about it.
The API gave me my first big learning moment: it was my first time using cursor pagination. With the Twitch API you can request only 100 items per call, and cursors work like pages in a newspaper—you can’t jump straight to page 3 without first going through pages 1 and 2. How can I be faster when dealing with big data and therefore many pages? My idea was to fetch only the IDs first; once I have them, I can use faster methods such as multithreading and batch requests, because I no longer need to respect the cursor‑pagination limits for the detailed lookups.
Multi‑threaded requests in Python
My initial idea was to use ThreadPoolExecutor
(TPE) to send requests in parallel. With TPE we can perform parallel network calls: given all the IDs in a JSON file, we can scale up and send hundreds of requests simultaneously, making the process faster and more efficient. Imagine a store with one cashier and a long line. Each customer must swipe a card and wait for the bank’s approval before paying; most of that time, the cashier just stands there. If we hire ten cashiers (spawn ten threads with ThreadPoolExecutor
), they can start serving new customers while the earlier ones wait for the bank. Because the bottleneck is the bank’s response (network I/O), not the cashier’s math skills (CPU), adding cashiers drastically reduces total waiting time—up to the point where we hit the store’s rate limit.
To further optimize my pipeline, I’ve also explored another technique to handle large datasets efficiently.
Batch Requests to Boost Efficiency
Alongside multithreading, I’ve also implemented batch requests to optimize data retrieval from the Twitch API. Batch requests allow me to group multiple queries into a single HTTP request, reducing the number of network calls when fetching details for hundreds of user or stream IDs. Instead of requesting data for each ID individually, I send a 'shopping list' of IDs in one go, and the server returns all the data in a single response, saving time and avoiding rate limits.
This approach taught me to balance batch sizes with API constraints while maximizing efficiency. Combined with multithreading, batching makes my pipeline faster and more scalable for real-time analysis. If you’ve worked with batch processing or API optimizations, I’d love to hear your insights.