Graph Mining for HPC Analytics – Neo4j and Property Graphs

Hi everyone, and welcome back to another blog post. I will explain the basics of what make a property graph and how to use a tool called NEO4J to create a property graph based database.

What is a property graph? They have the same definition as most graphs. They are composed of vertices edges (G = {V, E}. In property graphs the terminology for these is nodes and relationships. Along with having nodes and relationships, what makes a property graph unique are whats called the properties, which is the information that can either be attached to the nodes or the relationships.

Let’s discuss how nodes are defined. They are considered entities. They can hold any amount of data as key-value pairs (properties). Nodes can also be labeled, to specify what domain in the graph they belong to (sub-graphs).

Relationships are similar to edges in a traditional graph except they have more minimum restrictions in a property graph. Relationships in a property graph must include a start node, end node, direction and a type. Just as nodes have properties, relationships can also hold properties. Moreover, even though relationships have a direction, they can be easily traversed in any direction.

Where does neo4j come into play? Neo4j is a native graph database that implements the data structure for the property graph mentioned above. Neo4j is a open-source, NoSQL that provides a ACID compliant backend seen in many other databases. In the next blog we will dive into more specifics of Neo4j, and how to use a declarative query language called Cypher that is similar in many ways to SQL but is optimized to work with graph databases.

Graph Mining for HPC Analytics – Introduction

Hello, my name is Luis Bobadilla. I will be writing my findings as I conduct my research for my masters thesis. In this post I will set the background and related work that my thesis is based on.

As a graduate research assistant for the laboratory for knowledge discovery in databases (KDD Lab), a machine learning research lab, I am tackling a problem for an ongoing project that many students are on. The project is named the HPC Analytics project.

At Kansas State University we have a high performance computing cluster called beocat (https://beocat.ksu.edu/). Our goal is attempt to have the most efficient system utilization and resource allocation cluster. Beocat users span from all disciplines at the university, from biology, statistics, and a range of engineering departments. When users submit their jobs to beocat they specify the number of nodes, cpus, memory and time limit. This sometimes leads to an issue for less experienced users. They can either underestimate or overestimate some of those parameters leading their job to fail or use up unnecessary system resources.

As a team, we have done a few approaches to predicting the memory and cpu needed for a given job. These approaches can be read in the following two papers.

1. http://kdd.cs.ksu.edu/Publications/Conference/andresen2018predictive.pdf
2. http://kdd.cs.ksu.edu/Publications/Conference/tanash2019improving.pdf

What I’m working specifically is adding a few more features to the dataset involving role extraction. In the next post we will discuss the set-up for the graph database used (NEO4J).