文本挖掘
1. 教学信息
- 崔万云
- cui.wanyun@sufe.edu.cn
- 答疑时间:周五:13:30-15:00,请事先邮件预约
- 办公室:信息管理学院306室
- 助教:闫森
- Kiiiiii1@163.com
- 请将课程作业等发至助教邮箱
运用Kaggle网站
2. 参考书目
- 自然语言处理
- Speech and Language Processing, Daniel Jurafsky
- Oxford Deep NLP 2017 course: https://github.com/oxford-cs-deepnlp-2017/lectures
- 深度学习
- Deep Learning, lan Goodfellow and Yoshua Bengio and Aaron Courville,
- https://www.deeplearningbook.org/
- Pytorch
- 官方教程
- 中文文档
- 神经科学、哲学类
- Learning how to learn https://www.coursera.org/learn/ruhe-xuexi/home/welcome
- The book of why: the new science of cause and effect, Pearl
- 思考,快与慢 丹尼尔·卡尼曼
3. Information Extraction<信息抽取>
Hi Dan,we’ve now scheduled the curriculum meeting. It will be in Gates 159 tomorrow from 10:00-11:30. Chris
- Event
- Time
- Place
4. Information Extraction && Sentiment Analysis
先通过知识抽取,把一条淘宝的评论中的信息抽取出来,作为tag,比如服务、物流……再通过情感分析判断这句话是评价good or bad。
5. Google翻译
6. Language Technology
mostly solved (目前钓鱼邮件分辨很难处理)
- Spam detection 垃圾邮件监测
- Part-of-speech(POS) tagging 词性分析
- Named entity recognition (NER) 寻找出主谓宾
making good progress
- Sentiment analysis 情感分析
Word sense disambiguation
- Parsing
Machine translation(MT)
- Information extraction(IE)
- Question answering (Q & A) 单轮
still really hard
- Paraphrase 判断两句话是否是一个意思
- Summarization
- Dialog 多轮对话
- Coreference resolution 代词指代谁
- Jim comforts Kevin because he is sympathetic/crying
7. Why else is natural language understanding difficult?
- non-standard English
- segmentation issues
- idioms
- neologisms
- world knowledge (self-supervised learning)
- tricky entity names
8. Sentence representation
- Bag-of-words 词袋模型 {Jim,comforts,Kevin,……}={comforts,Kevin,Jim,……}无序
- N-gram model
- 2-gram: Jim-comforts,comforts-Kevin,Kevin because
- 3-gram: Jim comforts Kevin,comforts-Kevin-because
- Embedding 基于神经网络表示
9. 质疑与进展
SQuAD1.1 Leaderboard
Optimization: neural network + attention + self-supervision
神经网络 + 注意力机制(Bengio) + 自监督学习(Google)
2020年NLP技术的国内外前沿对比
10. Skills you’ll need
- Simple linear algerbra(vectors,matrices)
- Basic probability theory
- Python programming
- Neural networks
- AND Pytorch!
11. Outline
Paty I -Neural Networks are our friends
Model = function + params
y=wx+b
w,b : params
y : output
x : input
Input - Fixed, comes from data
Parameters - Need to be estimated
yhat: true data
12. Loss/Cost Function are our friends
L(model) ->R 模型对训练数据的损失
等价于参数对训练数据的损失——因为参数是未知的,model=function+params
L(params)->R
输入一个model 得出一个损失,输入一组params输出一个损失。
13. 小样本学习
通常的机器学习任务:给定模型(人定),计算机求解模型
模型搜索任务:计算机求解合适的模型,及该模型的参数
neural architecture search