Hierarchical Truth Discovery Data Sets


We build two datasets to evaluate the peformances of truth discovery algorithms using hierarchies. Each dataset contains claims from web pages, a hierarchy for the claimed values and ground truths.

More details can be found in the following paper.


Source (Citation)

  • Woohwan Jung, Younghoon Kim and Kyuseok Shim. Crowdsourced Truth Discovery in the Presence of Hierarchies for Knowledge Fusion, EDBT 19 [Paper] [Code]

Heritages

This is a dataset of the locations of World Heriatage Sites, the places of special cultural or physical significance provided by UNESCO World Heritage Centre, available at http://whc.unesco.org.
  • Statistics
    Number of objects 785
    Number of sources 1,577
    Number of claims 4,424
    Number of nodes in the hierarchy 1,027

    a. Claims [Download]
    obj src value
    b. hierararchy [Download]
    ID name parentID
    c. Ground truths[Download]
    obj val

    You can download the official world heritage list in UNESCO World Heritage Centre web sites

    Official World Heritage List

    d. Crowdsourced answers [Download]

    This contains answers collected from 20 workers in a commercial crowdsourcing platform Amazon Mechanical Turk.
    obj workerId value

BirthPlaces

This dataset contains birthplaces of 6,005 celebrities.
  • Statistics
    Number of objects 6,005
    Number of sources 7
    Number of claims 13,510
    Number of nodes in the hierarchy 4,999

    a. Claims [Download]
    obj src value
    b. hierararchy [Download]
    ID name parentID
    c. Ground truths

    We used IMDb data as the ground truths in the paper. However, unfortunately, birthplaces of directors/actresses/actors are not available at IMDb now (2019.1.16.). In addition, we cannot redistribute the old version of data since IMDb's policy prohibits redistibution of the data. Thus, we provide alternative links to a dataset and an API which contain birthplaces of peoples. We think the following datasets can be used to evaluate the performance of truth discovery algorithms.

    UDBMS Group - Film dataset

    THE MOVIE DB API

Research Interests

  • Natural Language Processing
  • Data Integration
  • Crowdsourcing
  • Differential Privacy

Data Sets









Designed by Responsive HTML Templates