GSoC First Coding Period With Pharo Consortium | by Joshua Dias Barreto

Introduction

As a GSoC (Google Summer of Code) contributor at Pharo Consortium, I have been working on an exciting project titled ‘DataFrame Improvements’. In this blog post, I will take you through the progress I have made during the first coding period of this project. I will also highlight my journey, the challenges I faced, and the accomplishments achieved in enhancing Pharo’s DataFrame library.

Project Overview

The aim of my project is to enhance the functionality and usability of Pharo’s DataFrame library, a powerful tool for data analysis and manipulation. During the first coding period, my focus was on implementing key features and addressing some of the existing limitations in the library.

Enhanced Data Representation 👓

To convert DataFrame objects to other formats, I implemented the #toMarkdown, #toLatex, #toHtml, and #toString methods. These methods allow users to convert DataFrame objects into various formats for easier sharing, visualization, and integration into different workflows. The #toMarkdown method enables the generation of Markdown tables, which are widely used in documentation and online platforms. The #toLatex method facilitates the creation of LaTeX tables, ideal for academic and scientific publications. The #toHtml method generates HTML tables, which are well-suited for web-based applications. Finally, the #toString method provides a human-readable representation of the DataFrame, allowing users to quickly inspect the content. Users can now effortlessly transform and present their data in multiple formats.

Improved the README of the DataFrame Repository 📑

In the “Very Simple Example Section,” I replaced low-quality screenshots with more visually appealing and structured Markdown tables, making it easier for users to understand and replicate the examples. It is much easier to update these Markdown tables rather than replacing screenshots.
Additionally, I replaced the outdated image of the FastTable with an updated DataInspector image, providing a clearer representation of the library’s current functionality.
I added a more detailed description that highlights the key features and benefits of the library. This allows users to quickly understand the capabilities and potential applications of DataFrame.
Furthermore, I created a new section titled “Documentation and Literature,” where users can find valuable resources such as the DataFrame Booklet and a research paper.

Enhanced the Sorting API 📂

I implemented a set of methods for chain sorting in the DataFrame library such as #sortByAll: and #sortDescendingByAll: . Users can now define a sequence of columns and their corresponding sort orders, enabling them to create complex sorting rules tailored to their specific requirements.
Added methods in the DataFrame library to enable sorting based on row names such as #sortByRowNames, #sortByRowNamesUsing: and #sortDescendingByRowNames. This new functionality provides users with the ability to sort DataFrame objects based on the names assigned to each row. Sorting by row names can be particularly useful in scenarios where the order of rows carries specific meaning or when organizing data based on predefined categories.

Bug Fixing 🐛

Bug fixing is a crucial aspect of software development, as it ensures the stability and reliability of the library for users. These are some of the bugs I fixed:

#addRow originally didn’t consider key ordering, now it does, so users can now add a dataseries to a dataframe in any order as long as the keys match the column names.
#removeNils and #withoutNils both did the same thing on a dataseries, i.e. returning a copy of the dataseries without nils. I changed the implementation so that #removeNils removes nils from the original dataseries and #withoutNils returns a copy without nils.
#columns originally returned a collection of arrays, it now returns a collection of series because a DataSeries has more information and methods to deal with things as compared to arrays, this was also done to make the API more consistent.
Statistical methods such as variance, quartiles, standard deviation, etc. would signal errors if nils were present, now these methods can handle nil values.

Miscellaneous Methods Added 💻

#describe this method statistically describes a data frame listing out the mean, quartiles, variance, etc.
#encodeOneHot this method encodes data into one-hot vectors. It works on all kinds of data, integers, decimals, strings, Roman Numerals, etc.
#asDataFrame it converts a collection or collection of collections into dataframes.
#removeDuplicatedRows removes duplicate rows from a data frame except the first occurence.
#numericalColumns returns only the columns of the data frame having numerical data.
numericalColumnNames returns only the names of the columns of the data frame with numerical data.
#CountNils returns the number of nil values in a dataseries.
#CountNonNils returns the number of non-nil values in a dataseries.
#replaceNilsWithNextRow replaces nils in a data frame with values of the next non-nil row.

Documentation 📃

Updated the outdated methods in the DataFrame booklet.
Added a section on ‘Handling Nil Values’ in DataFrames and DataSeries in the DataFrame booklet.
Added comments and runnable examples for over 150 methods in the DataFrame and DataSeries classes.

Tutorials 🎓

In addition to enhancing the DataFrame library, I also spent some time creating tutorials. These tutorials range from topics such as data manipulation, web scraping and machine learning.

By following these tutorials, users can quickly grasp the concepts and gain hands-on experience in utilizing DataFrame effectively. The tutorials cover a wide range of scenarios and real-world use cases, enabling users to apply DataFrames to their own data analysis tasks.

Plans For the 2nd Coding Period

I would like to add a JSON normalizer to the DataFrame library. JSON is a tree structured data, flattening it into a dataframe would make visualizing the data much simpler.
Bug fixing is an ongoing process, and I remain committed to actively identifying and resolving any issues that may arise.
Add commonly used datasets to the Pharo AI Datasets library and tutorials for the same.