With a clear understanding of their needs, Boston University decided to commission a custom-built data crawler. They sought a robust solution that could handle real-time data streaming from Twitter’s API, rapidly processing large volumes of data. The new solution should seamlessly integrate with their existing systems and be scalable enough to handle other APIs as research needs expanded.
The university also required the system to have the versatility to filter data based on specific research criteria, ensuring each research project could access relevant and targeted data. After a thorough evaluation of several vendors, Boston University chose RichBrains, confident in our strong expertise in digital transformation, data migration, and integration projects.
Taking up this ambitious project, RichBrains developed a scalable distributed data crawler from scratch. The solution we proposed involved using cutting-edge technologies like Kafka and Spark Streaming. Kafka was employed to handle real-time data ingestion, and Spark Streaming was utilized for real-time data processing.
The system was designed to operate on multiple servers to accommodate the high volume and velocity of data from Twitter’s API. It was a challenge to create a system capable of handling millions of tweets rapidly, but our team implemented distributed data processing across multiple nodes, thereby optimally managing the data’s volume and velocity.
The solution’s standout feature was its scalability. Initially designed to handle Twitter’s API, the system was built to be flexible enough to expand to other APIs, providing a future-proof solution for the ever-evolving research needs of the university.