Mining Text Data


文本挖掘

1. 教学信息

  • 崔万云
    • cui.wanyun@sufe.edu.cn
    • 答疑时间:周五:13:30-15:00,请事先邮件预约
    • 办公室:信息管理学院306室
  • 助教:闫森

运用Kaggle网站

2. 参考书目

3. Information Extraction<信息抽取>

Hi Dan,we’ve now scheduled the curriculum meeting. It will be in Gates 159 tomorrow from 10:00-11:30. Chris

  • Event
  • Time
  • Place

4. Information Extraction && Sentiment Analysis

先通过知识抽取,把一条淘宝的评论中的信息抽取出来,作为tag,比如服务、物流……再通过情感分析判断这句话是评价good or bad

5. Google翻译

6. Language Technology

  • mostly solved (目前钓鱼邮件分辨很难处理)

    • Spam detection 垃圾邮件监测
    • Part-of-speech(POS) tagging 词性分析
    • Named entity recognition (NER) 寻找出主谓宾
  • making good progress

    • Sentiment analysis 情感分析
  • Word sense disambiguation

    • Parsing
  • Machine translation(MT)

    • Information extraction(IE)
    • Question answering (Q & A) 单轮
  • still really hard

    • Paraphrase 判断两句话是否是一个意思
    • Summarization
    • Dialog 多轮对话
    • Coreference resolution 代词指代谁
      • Jim comforts Kevin because he is sympathetic/crying

7. Why else is natural language understanding difficult?

  • non-standard English
  • segmentation issues
  • idioms
  • neologisms
  • world knowledge (self-supervised learning)
  • tricky entity names

8. Sentence representation

  • Bag-of-words 词袋模型 {Jim,comforts,Kevin,……}={comforts,Kevin,Jim,……}无序
  • N-gram model
    • 2-gram: Jim-comforts,comforts-Kevin,Kevin because
    • 3-gram: Jim comforts Kevin,comforts-Kevin-because
  • Embedding 基于神经网络表示

9. 质疑与进展

  • SQuAD1.1 Leaderboard

    Optimization: neural network + attention + self-supervision

    神经网络 + 注意力机制(Bengio) + 自监督学习(Google)

  • 2020年NLP技术的国内外前沿对比

10. Skills you’ll need

  • Simple linear algerbra(vectors,matrices)
  • Basic probability theory
  • Python programming
  • Neural networks
  • AND Pytorch!

11. Outline

  • Paty I -Neural Networks are our friends

  • Model = function + params

    y=wx+b

    w,b : params

    y : output

    x : input

    Input - Fixed, comes from data

    Parameters - Need to be estimated

    yhat: true data

12. Loss/Cost Function are our friends

L(model) ->R 模型对训练数据的损失

等价于参数对训练数据的损失——因为参数是未知的,model=function+params

L(params)->R

输入一个model 得出一个损失,输入一组params输出一个损失。

13. 小样本学习

通常的机器学习任务:给定模型(人定),计算机求解模型

模型搜索任务:计算机求解合适的模型,及该模型的参数

neural architecture search

Into Deep Learning

Nonlinear Neural Models


文章作者: Shen Hao
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Shen Hao !
评论
评论
  目录