From In-House to Open Source: The Journey of PyHPCC

PyHPCC is a Python package and wrapper built around the HPCC Systems web services that facilitates communication between Python and HPCC Systems. It was originally developed as a LexisNexis Risk Solutions internal tool to automate repetitive work done on HPCC Systems. In this blog, Amila De Silva, a distinguished member of our community, walks us through the journey of PyHPCC, from its inception to its current status as an open source tool. Let’s join Amila and celebrate this important milestone as she reflects on the challenges faced and the successes of this amazing journey!

Backstory

PyHPCC was originally developed as a tool to streamline and optimize communication between Python and HPCC Systems. Some core functionalities that PyHPCC supports are:

  • Workunit submission through inline queries or using a Git Repository.
  • Reading contents from a logical file.
  • Uploading/downloading files to a landing zone.
  • Spraying fixed and variable-length files.
  • Making Roxie calls.

Because of these features, PyHPCC quickly became our go-to solution for automating workunit submissions to HPCC Systems. It was highly effective within our immediate team, and we soon realized its potential to benefit other teams as well. In 2022, I presented the project during the HPCC Systems Community Summit, introducing PyHPCC to our community. The response was overwhelmingly positive, with many requests to leverage PyHPCC for various use cases, particularly automation.  Along with the growing interest in adopting PyHPCC across the broader HPCC Systems community, there was also a considerable amount of feedback on ways to enhance the PyHPCC project, such as the need for detailed documentation, a robust issue-tracking system, and scalable code.

Thinking Outside the Box

Given this scenario, it seemed to me that making PyHPCC an open source project could be the perfect solution—the community could contribute and potentially create better, more innovative solutions than I could.

However, converting an internal tool into a polished, open source package isn’t a straightforward process. There are several critical aspects to consider, such as ensuring the codebase meets certain standards, writing thorough documentation, and creating a clear roadmap for future contributions. To achieve this, I needed extra help, which led me to explore the possibility of working with an intern to move the project forward.

In early 2024, I reached out to the HPCC Systems Summer Internship team to explain the value of open sourcing PyHPCC. Together, we realized that it was a win-win scenario since the intern would also gain hands-on experience with a real-world product. Within days, the PyHPCC project was listed for students to submit their proposals and in just one week, we received about 20 proposals, with several more students inquiring about the project. After reviewing the proposals and interviewing candidates, we shortlisted the finalists and ultimately selected Rohith Podugu, who stood out for his enthusiasm, skill set, and eagerness to learn. You can learn more about Rohith’s internship project on his wiki page and blog.

The Working Team

During the summer of 2024, Rohith and I spent the first few weeks discussing our internal processes, reviewing the PyHPCC codebase, and mapping out the roadmap for open sourcing the project. This was a crucial phase, as I wanted to ensure Rohith was comfortable with the project and our goals. I assigned him a few initial tasks, such as researching linters like Black and Flake8. He surprised me by suggesting a different tool called Ruff, which was both a linter and a code formatter.

Rohith’s involvement and enthusiasm were exceptional and consistent. Recognizing his drive, I decided to step back and let him take the lead. He didn’t need constant direction, so I allowed him to drive the project, meeting with him every other day to check in, help clarify his thought process and empower him. Watching Rohith grow and excel in his technical, problem-solving, and communication skills was incredibly rewarding.

Among the main outcomes produced as part of the work collaboration with Rohith are:

1) New PyHPCC feature to specify compile and run options during workunit submission.

2) Revamped logical file read functionality to retrieve batches of data.

3) A restructured code repository adhering to PEP-8 conventions.


4) Usage of Poetry for dependency/environment management, Ruff as the formatter and linter, and pre-commit hook for code formatting on commit.


5) Release automation and pull request workflow checks using GitHub Actions.


6) Improved test coverage, including coverage details displayed in the pull request.


7) A comprehensive documentation using Markdown and Sphinx targeting not only PyHPCC users but also contributors and maintainers.

The Final Push

During the final weeks of the internship, Rohith and I were invited to present the open sourcing of PyHPCC at the HPCC Systems Community Summit. We immediately agreed as it was the perfect platform to showcase our work and further promote PyHPCC. In parallel, we collaborated closely with the HPCC Systems team to officially publish PyHPCC as an open source project under the HPCC Systems GitHub organization. You can now access the finalized repository at: https://github.com/hpcc-systems/pyhpcc/. Even after Rohith’s internship ended, he stayed in touch, and we recorded a session reflecting on our experience with open sourcing PyHPCC. Our session was presented live during the 2024 HPCC Systems Community Summit. If you missed the talk, you can still watch the recording of the session on the HPCC Systems Youtube Channel.

Looking Ahead

The past summer marked a major milestone for PyHPCC as we transitioned it from an internal package into an open source project. This journey wasn’t just about sharing our work with the global community but also about collaboration, mentorship, and teamwork. As a consequence, transforming PyHPCC from an internal tool into an open source package was a rewarding experience for everyone involved. Now that PyHPCC is open source, we are excited to see how the developer community will contribute and help it grow. I would like to welcome all contributors and maintainers interested in joining the PyHPCC project. Whether you are new or experienced, your input is valuable in helping the project grow and we are excited to collaborate with you!


About the Authors

Amila De Silva

Senior Software Engineer, LexisNexis Risk Solutions

Amila De Silva is a Software Engineer on the Innovation Engineering team of LexisNexis Risk Solutions. She joined the company in 2018 after completing her Master’s degree in Computer Science from St. Cloud State University. Since then, Amila has been involved in projects that leverage various kinds of technology, such as automation, building scalable applications, machine learning, and web design. She constantly seeks a holistic, inclusive, and simplified approach when designing software. One of her current technical expeditions includes migrating existing internal applications to the Azure Cloud with her team. Amila is the recipient of the 2019 Risk Solutions Ada Lovelace Award for the Rising Star in Technology. When Amila is not glued to her computer screen, she spends her time reading mind-bending books, painting, and practicing yoga.

Rohith Surya Podugu

2024 HPCC Systems Intern, LexisNexis Risk Solutions

Rohith Surya Podugu is a Master’s graduate from California State University, Los Angeles. During the summer of 2024, he worked as an HPCC Systems Intern under Amila De Silva, focusing on the refactoring and release of the PyHPCC project. Rohith currently works as a full stack developer in the Global Infrastructure Team at Nomura. Prior to this, he also worked at Tata Communications and Norton Lifelock. Rohith is passionate about the latest trends in Web Development and Machine Learning. Notably, his team secured seed funding for their startup idea during his bachelor’s degree. Outside of work, Rohith enjoys reading books, lifting weights at the gym, and watching movies.