This post is a brief summary about the paper that I read for my study and curiosity, so I shortly arrange the content of the paper, titled Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task (Yu et al., EMNLP 2018), that I read and studied.

They proposed new dataset for Text2SQL task complementing the shortcomings of the conventional Text2SQL dataset.

  • First, SQL query split where no SQL query is allowed to appear in more that one set among the train, dev, and test set.
  • Second, The hardness of SQL is separate into the detailed items such as Easy, Medium, Hard, Extra Hard for complex SQL rather than the existing simple version like WikiSQL.
  • Third, The number of tables in dataset is more than one table to resolve the generalization to new domain for Text2SQL.

Yu et al., EMNLP 2018

For detailed experiment and explanation, refer to the paper, titled Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task (Yu et al., EMNLP 2018)