Evaluation data

The following datasets were used to evaluate the LION LBD system as detailed in our published work. We provide these datasets for the replicability of our results.

LBD systems are typically evaluated by predicting an A-B-C chain (using either open or closed discovery modes) that describes an established discovery. This is done by only using published literature prior to the publication of the discovery (i.e. before a cut off year). This is usually referred to as time travel evaluation. We have prepared below the data neccessary to conduct time travel evaluation for ten discoveries.

We evaluate the LION LBD system on two types of discoveries. The first is a set of five landmark cancer discoveries that were chosen by a team of cancer researchers (see our published work for details). The second set are other medical discoveries that were used by Don R. Swanson to evaluate the earliest literature-based discovery methods.

Each dataset consists of three CSV files (about 9.6 GB compressed, 22 GB uncompressed):

  • nodes.csv - contains all of the nodes (concepts) in the graph. Every node has a unique ID (under the OID column).
  • edges.csv - contains all edges in the graph where each edge represents a co-occurrence of the two connecting concepts. Each row contains the OID of the two connecting nodes, the earliest year for the co-occurrence of the two concepts, and the value array for each metric in LION. Each array represents all the metric values from the earliest year that the co-occurrence is recorded till the time of our experiments (2017). Therefore if an edge appears first in the year 2000, then its metrics arrays will each contain 17 elements.
  • meta.csv - this file contains meta information about the graph and its aggregation (not necessary for any evaluation).

The cancer landmark discovery set

The following datasets can be used to conduct both open and closed discovery evaluations for the five cancer landmark discovery cases:

  • NF-κB (PR:000001754) , Bcl-2 (PR:000002307) , Adenoma (MESH:D000236) (dataset)
  • NOTCH1 (PR:000011331) , Senescence (HOC:42) , C/EBPΒ (PR:000005308) (dataset)
  • IL-17 (PR:000001138) , p38α (PR:000003107) , MKP-1 (PR:000006736) (dataset)
  • Nrf2 (PR:000011170) , ROS (CHEBI:26523) , Pancreatic cancer (MESH:D010190) (dataset)
  • CXCL12 (PR:000006066) , Senescence (HOC:42) , Thyroid cancer (MESH:D013964) (dataset)

The Swanson set

The following datasets can be used to conduct open discovery evaluations (only) for the five Swanson cases:

  • Migraine (MESH:D008881), Magnesium (MESH:D008274) (dataset)
  • Somatomedin C (PR:000009182), Arginine (CHEBI:29016) (dataset)
  • Alzheimer's disease (MESH:D000544), Estrogen (MESH:D004967) (dataset)
  • Alzheimer's disease (MESH:D000544), Indomethacin (MESH:D007213) (dataset)
  • Schizophrenia (MESH:D012559), Calcium Independent Phospholipase A2 (PR:000012942) (dataset)