Draft Paper to be submitted to 4th International Conference on Interactive Digital Media (ICIDM) 2015:

Implementation of Graph Database for OpenCog Artificial General Intelligence Framework using Neo4j

Summary

Create a graph backing store for OpenCog using Neo4j graph database. The GraphBackingStore API will extend the current BackingStore C++ API. It can take special queries that map naturally into Cypher queries and simple manipulations of Neo4j graph traversals. The Neo4j node-relationship structures and custom indices will be optimized for AtomSpace usage and performance. neo4c C/C++ library will be improved to allow OpenCog C/C++ code to execute Neo4j Cypher queries over REST.

See: Google Summer of Code 2015 Proposal

Student Profile

Hendy Irawan (ceefour666@gmail.com)

Graduate student in Electrical Engineering, Bandung Institute of Technology, Indonesia

Motivation

I have background experience working with Neo4j, in a social commerce app, and also adapting Yago2s database into Neo4j (resulting about 14 GiB graph DB). I believe it would be practical to use Neo4j for AtomSpace as well.

In my experience, Neo4j has excellent traversal performance. But before 2.0 it was hard to get good lookup performance. Fortunately since v2.0 Neo4j has easy-to-use indexes (based on Lucene) and I’ve found by using this feature carefully, lookup operations are fast too.

I also agree with Dr. Goertzel that the schema and general structure should be adaptable to Hypergraph DB.

Skills

I am capable in C/C++, Java, HTML, JavaScript, and CSS. I have working knowledge and experience in installing and running Neo4j, designing graph databases, and executing Cypher question. I have beginner knowledge in Python and Scheme. I am prepared to learn Scheme further to complete this task.

Past Contributions and Patches

About OpenCog

OpenCog is an umbrella project for several open source projects with a vision of realizing Artificial General Intelligence (AGI) using the integrative approach. With this approach, several narrow AI modules are integrated and work together to perform AGI tasks.

The core module of OpenCog is AtomSpace, used to represent various kinds of knowledge inside the OpenCog framework. Other OpenCog modules include Probabilistic Logic Networks (PLN), MOSES, and RelEx.

Goals & Benefits

The motivation for suggesting Neo4j is a combination of the following factors:

  1. Structure. It’s a graph DB so its internals match those of the AtomSpace reasonably well.
  2. Licensing. It has reasonable OSS license terms (GPLv3 and AGPLv3).
  3. Ecosystem. It has a robust software ecosystem around it (e.g. connection to web-services frameworks, plugins for spatial and temporal indexing, etc.) and a fairly large, active user community.
  4. Works well. OpenCog team members, notably Rodas and Eskender, have used it before so we have some validation that there aren’t weird, obvious gotchas in its usage.
  5. Potential customers. As a side point: a couple potential customers for OpenCog work are already using Neo4j, so using it will help with these particular business relationships.
  6. Performant. A specific analysis of the types of queries we probably want to execute against a backing store in the near future, and a comparison of this to Neo4j’s querying and indexing methods, suggests that we should be able to execute these queries reasonably efficiently against Neo4j, via appropriate machinations as will be suggested below.
  7. Scalability. Neo4j can be run distributed across multiple machines; currently this uses a master-slave architecture, but there is momentum behind scaling it further. Scalability requirements and issues are discussed in Scaling OpenCog. See also Distributed AtomSpace Architecture.
  8. Persistence. The current in-memory AtomSpace and pattern matcher design can run with high performance, without bottlenecks, when the database fits in RAM. The goal is to provide a graph backing store which can perform at least on par or better than PostgreSQL backing store, when the database is sufficiently large or cannot fit in RAM.

OpenCog, in the medium-to-long term, is not going to commit to any particular backing store technology; the BackingStore API should remain storage-technology-independent. However, in the short-to-medium term, the choice of backing store technology may have some meaningful impact on development and utilization of the system; so the choice of which backing stores to utilize isn’t a totally trivial choice even though it’s not a “permanent” one. For example, it should possible to implement the GraphBackingStore API using HypergraphDB.

Challenges & Planned Approaches

Challenge: Sufficiently small datasets could be fully in performed in RAM, efficiently.

Approach: We’ll use the OpenCog Bio knowledge base (dataset in separate repo at git@gitlab.com:opencog-bio/bio-data.git), a moderately sized dataset about 212 MiB in size, to test performance.

Challenge: The pattern matcher has a hefty setup overhead, and makes a number of worst-case, non-optimal assumptions about how to perform the query. In essence, its designed to work well for complex queries, not simple ones.

Approach: It’s planned to use the native Cypher query for queries, which is expected to perform better than running the OpenCog pattern matcher on top Neo4j. Neo4j is also expected to be performant when running a complex query, or a simple/complex query returning many results at once, compared to running many queries returning only single results.

Design and Implementation

Currently OpenCog supports a PostgreSQL backing store. It will be augmented with a graph backing store backed by Neo4j, accessible via the same API and via the extended graph API.

For comparison, the current API for accessing the (in-RAM) AtomSpace can be seen in opencog/atomspace/AtomSpace.h

Pattern Matcher to Cypher Query Translator

As an illustration during initial discussion, suppose we have the query “find drugs whose active ingredients dock with the protein FKH1”. This would be formulated as:

(AND
  (Inheritance 
    $X 
    (ConceptNode "drug"))
  (Evaluation
    (PredicateNode "dock")
    (List
      $X
      (ProteinNode "FKH1"))))

The Neo4j backing store will need to translate such pattern matcher query into Cypher query as follows: (try this query online at http://console.neo4j.org/r/wmrc6v)

MATCH
  (i:Inheritance) <-[:OPERAND]- (:And) -[:OPERAND]-> (e:Evaluation),
  (i) -[:SUPER]-> (:Concept {id: "drug"}),
  (i) -[:SUB]-> (x:Concept),
  (e) -[:PREDICATE]-> (:Predicate {id: "dock"}),
  (e) -[:PARAMETER {position: 0}]-> (x),
  (e) -[:PARAMETER {position: 1}]-> (:ProteinNode {id: "protein_fkh1"})
RETURN x;

Query Results

+-------------------------------------------+
| x                                         |
+-------------------------------------------+
| Node[17]{name:"Acme drug",id:"acme_drug"} |
+-------------------------------------------+
1 row

The above is only an illustration. The actual dataset that we plan to use for testing will be Bio knowledge base dataset in Scheme format which is a 212 MiB database which is more representative in measuring the performance of the Neo4j based backing store.

Recursive Unification in Neo4j

Instead of performing unification (variable grounding) once, some situations may require multiple passes to fully ground the variables, if the unification candidates themselves contain variables.

I propose using Cypher’s built-in support for multiple MATCH clauses to implement this behavior. Each MATCH clause can both reuse previous variables and define new variables (to be reused by next MATCHes, or to be returned as values).

To illustrate, the following subpattern from the naive Modus Ponens sample:

(ImplicationLink
    (VariableNode "$A")
    (VariableNode "$B"))

may translate simply to:

MATCH
  (a) <-[:WHEN]- (i:Implication) -[:THEN]-> (b)
RETURN i, a, b;

However, we want to make sure we find VariableNodes:

MATCH
  (a) <-[:WHEN]- (i:Implication) -[:THEN]-> (b)
OPTIONAL MATCH
  (a) -[:*]-> (av:Variable),
  (b) -[:*]-> (bv:Variable)
RETURN i, a, b, av, bv;

Those VariableNodes (av and bv) can then be used to translate the pattern to Cypher again in the next iteration. This flattens the recursive algorithm into an iterative one. For a recursion 3 levels deep, we will need to execute 3 Cypher queries.

More information about recursive unification is available in Recursive unification using the Pattern Matcher - OpenCog Wiki.

Additional Details

In order to make this Google Summer of Code proposal concise, there are longer, more complete details about the plan available and will be updated during work at: Neo4j Backing Store - OpenCog Wiki

Timeline and Administration

  1. Up to start of Google Summer of Code 2015:

  2. May 25-31, 2015:

    • Specify the Neo4j schema for Scheme dumps in general, and especially for OpenCog Bio knowledge base dataset
  3. June 1-7, 2015:

    • Create the Scheme dump importer (using Java/Clojure)
    • Import the OpenCog bio knowledge base dataset Scheme dump into Neo4j
  4. June 8-14, 2015:

    • Execute handcoded Cypher queries
    • Tweak indexes/schema/etc to optimize queries while retaining convenient graph schema both programmatically and ideally for human consumption as well
  5. June 15-21, 2015:

    • Implement a pattern-matcher-to-Cypher transformer
  6. June 22-28, 2015: (mid-term evaluation June 26-Jul 2)

    • Implement a pattern-matcher-to-Cypher transformer
  7. June 29-July 5, 2015:

  8. July 6-July 26, 2015:

    • Implement Neo4jProxy for GraphBackingStore
    • bug fixes
    • performance tweaks
  9. July 27-Aug 9, 2015:

    • Test Neo4jProxy GraphBackingStore implementation using OpenCog Bio knowledge base dataset
    • bug fixes
    • performance tweaks
  10. Aug 10-16, 2015:

    • buffer time if any of the previous tasks are late or pending bugs
    • stretch goal: research additional moderately sized graph dataset (100-500 MiB) for OpenCog Neo4j Backing Store testing
  11. Aug 17-23, 2015:

    • buffer time if any of the previous tasks are late or pending bugs
    • stretch goal: initial performance comparison with Hypergraph DB

My master program study current term and exams end on May 20, 2015, and my term starts September 2015.

Planned time allocated for Google Summer of Code 2015 work during these timeline is 30 hours per week, in most cases also during weekends, when I also love to do my research.

I have ~8 Mbps internet connection whenever I’m on my campus, and I can also use HSPA connection using my mobile provider elsewhere.

I have signed the OpenCog Individual Contributor License Agreement.

I will send weekly email reports to OpenCog Google group and also update the Neo4j Backing Store - OpenCog Wiki.

Future Considerations