Haystack库
在这个指南中,我们将看到如何集成Pinecone和流行的Haystack库进行问答。
安装Haystack库
我们首先安装最新版本的Haystack,其中包括PineconeDocumentStore
所需的所有依赖项。
Python
!pip install -U farm-haystack>=1.3.0 pinecone-client datasets
初始化PineconeDocumentStore
我们通过提供API密钥和环境名称来初始化PineconeDocumentStore
。创建账户以获取您的免费API密钥。
Python
from haystack.document_stores import PineconeDocumentStore
document_store = PineconeDocumentStore(
api_key='<<YOUR_API_KEY>>',
index='haystack-extractive-qa',
similarity="cosine",
embedding_dim=384
)
INFO - haystack.document_stores.pinecone - Index statistics: name: haystack-extractive-qa, embedding dimensions: 384, record count: 0
数据准备
在将数据添加到文档存储之前,我们必须下载并将数据转换成Haystack使用的文档格式。
我们将使用Hugging Face Datasets提供的SQuAD数据集。
Python
from datasets import load_dataset
# load the squad dataset
data = load_dataset("squad", split="train")
接下来,我们将删除重复项和不必要的列。
Python
# convert to a pandas dataframe
df = data.to_pandas()
# select only title and context column
df = df[["title", "context"]]
# drop rows containing duplicate context passages
df = df.drop_duplicates(subset="context")
df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; }
.dataframe thead th { text-align: right; }
title | context | |
---|---|---|
0 | University_of_Notre_Dame | Architecturally, the school has a Catholic cha... |
5 | University_of_Notre_Dame | As at most other universities, Notre Dame's st... |
10 | University_of_Notre_Dame | The university is the major seat of the Congre... |
15 | University_of_Notre_Dame | The College of Engineering was established in ... |
20 | University_of_Notre_Dame | All of Notre Dame's undergraduate students are... |
然后将这些记录转换为文档格式。
Python
from haystack import Document
docs = []
for d in df.iterrows():
d = d[1]
# create haystack document object with text content and doc metadata
doc = Document(
content=d["context"],
meta={
"title": d["title"],
'context': d['context']
}
)
docs.append(doc)
此Document
格式包含两个字段;'content'表示文本内容或段落,'meta'表示我们可以放置任何其他信息,稍后可以用于在搜索中应用元数据过滤。
现在我们将文档upsert到Pinecone。
Python
# upsert the data document to pinecone index
document_store.write_documents(docs)
初始化检索器
下一步是从这些文档中创建嵌入式向量。我们将使用Haystack的EmbeddingRetriever
与一个SentenceTransformer模型(multi-qa-MiniLM-L6-cos-v1
),该模型专为问答而设计。
Python
from haystack.retriever.dense import EmbeddingRetriever
retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="multi-qa-MiniLM-L6-cos-v1",
model_format="sentence_transformers"
)
然后,我们使用提供的retriever
作为参数运行PineconeDocumentStore.update_embeddings
方法。使用GPU加速可以大大缩短此步骤所需的时间。
Python
document_store.update_embeddings(
retriever,
batch_size=16
)
检查文档和嵌入
我们可以使用PineconeDocumentStore.get_documents_by_id
方法通过文档ID获取文档。
Python
d = document_store.get_documents_by_id(ids=['49091c797d2236e73fab510b1e9c7f6b'], return_embedding=True)[0]
从这里,我们可以使用d.content
查看文档内容,使用d.embedding
查看文档嵌入向量。
初始化提取式问答管道
ExtractiveQAPipeline
默认包含三个关键组件:
一个文档存储库(
PineconeDocumentStore
)一个检索模型
一个阅读器模型
我们使用HuggingFace模型中心的deepset/electra-base-squad2
模型作为我们的阅读器模型。
Python
from haystack.nodes import FARMReader
reader = FARMReader(
model_name_or_path='deepset/electra-base-squad2',
use_gpu=True
)
现在,我们可以初始化ExtractiveQAPipeline
。
Python
from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)
提问
使用我们的QA检索管道,我们可以通过pipe.run
开始查询。
Python
from haystack.utils import print_answers
query = "What was Albert Einstein famous for?"
# get the answer
answer = pipe.run(
query=query,
params={
"Retriever": {"top_k": 1},
}
)
# print the answer(s)
print_answers(answer)
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 3.53 Batches/s]
Query: What was Albert Einstein famous for?
Answers:
[ <Answer {
'answer': 'his theories of special relativity and general relativity', 'type': 'extractive', 'score': 0.993550717830658,
'context': 'Albert Einstein is known for his theories of special relativity and general relativity. He also made important contributions to statistical mechanics,',
'offsets_in_document': [{'start': 29, 'end': 86}],
'offsets_in_context': [{'start': 29, 'end': 86}],
'document_id': '23357c05e3e46bacea556705de1ea6a5',
'meta': {
'context': 'Albert Einstein is known for his theories of special relativity and general relativity. He also made important contributions to statistical mechanics, especially his mathematical treatment of Brownian motion, his resolution of the paradox of specific heats, and his connection of fluctuations and dissipation. Despite his reservations about its interpretation, Einstein also made contributions to quantum mechanics and, indirectly, quantum field theory, primarily through his theoretical studies of the photon.', 'title': 'Modern_history'
}
}>]
Python
query = "How much oil is Egypt producing in a day?"
# get the answer
answer = pipe.run(
query=query,
params={
"Retriever": {"top_k": 1},
}
)
# print the answer(s)
print_answers(answer)
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 3.81 Batches/s]
Query: How much oil is Egypt producing in a day?
Answers:
[ <Answer {
'answer': '691,000 bbl/d', 'type': 'extractive', 'score': 0.9999906420707703,
'context': 'Egypt was producing 691,000 bbl/d of oil and 2,141.05 Tcf of natural gas (in 2013), which makes Egypt as the largest oil producer not member of the Or',
'offsets_in_document': [{'start': 20, 'end': 33}],
'offsets_in_context': [{'start': 20, 'end': 33}],
'document_id': '57ed9720050a17237e323da5e3969a9b',
'meta': {
'context': 'Egypt was producing 691,000 bbl/d of oil and 2,141.05 Tcf of natural gas (in 2013), which makes Egypt as the largest oil producer not member of the Organization of the Petroleum Exporting Countries (OPEC) and the second-largest dry natural gas producer in Africa. In 2013, Egypt was the largest consumer of oil and natural gas in Africa, as more than 20% of total oil consumption and more than 40% of total dry natural gas consumption in Africa. Also, Egypt possesses the largest oil refinery capacity in Africa 726,000 bbl/d (in 2012). Egypt is currently planning to build its first nuclear power plant in El Dabaa city, northern Egypt.', 'title': 'Egypt'
}
}>]
Python
query = "What are the first names of the youtube founders?"
# get the answer
answer = pipe.run(
query=query,
params={
"Retriever": {"top_k": 1},
}
)
# print the answer(s)
print_answers(answer)
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 3.83 Batches/s]
Query: What are the first names of the youtube founders?
Answers:
[ <Answer {
'answer': 'Hurley and Chen', 'type': 'extractive', 'score': 0.9998972713947296,
'context': 'According to a story that has often been repeated in the media, Hurley and Chen developed the idea for YouTube during the early months of 2005, after ',
'offsets_in_document': [{'start': 64, 'end': 79}],
'offsets_in_context': [{'start': 64, 'end': 79}],
'document_id': 'bd1cbd61ab617d840c5f295e21e80092',
'meta': {
'context': 'According to a story that has often been repeated in the media, Hurley and Chen developed the idea for YouTube during the early months of 2005, after they had experienced difficulty sharing videos that had been shot at a dinner party at Chen\'s apartment in San Francisco. Karim did not attend the party and denied that it had occurred, but Chen commented that the idea that YouTube was founded after a dinner party "was probably very strengthened by marketing ideas around creating a story that was very digestible".', 'title': 'YouTube'
}
}>]
我们可以通过设置top_k
参数来返回多个答案。
Python
query = "Who was the first person to step foot on the moon?"
# get the answer
answer = pipe.run(
query=query,
params={
"Retriever": {"top_k": 3},
}
)
# print the answer(s)
print_answers(answer)
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 3.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 3.78 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 3.88 Batches/s]
Query: Who was the first person to step foot on the moon?
Answers:
[ <Answer {
'answer': 'Armstrong', 'type': 'extractive', 'score': 0.9998227059841156,
'context': 'The trip to the Moon took just over three days. After achieving orbit, Armstrong and Aldrin transferred into the Lunar Module, named Eagle, and after ',
'offsets_in_document': [{'start': 71, 'end': 80}],
'offsets_in_context': [{'start': 71, 'end': 80}],
'document_id': 'f74e1bf667e68d72e45437a7895df921',
'meta': {
'context': 'The trip to the Moon took just over three days. After achieving orbit, Armstrong and Aldrin transferred into the Lunar Module, named Eagle, and after a landing gear inspection by Collins remaining in the Command/Service Module Columbia, began their descent. After overcoming several computer overload alarms caused by an antenna switch left in the wrong position, and a slight downrange error, Armstrong took over manual flight control at about 180 meters (590 ft), and guided the Lunar Module to a safe landing spot at 20:18:04 UTC, July 20, 1969 (3:17:04 pm CDT). The first humans on the Moon would wait another six hours before they ventured out of their craft. At 02:56 UTC, July 21 (9:56 pm CDT July 20), Armstrong became the first human to set foot on the Moon.', 'title': 'Space_Race'
}
}>, <Answer {
'answer': 'Frank Borman', 'type': 'extractive', 'score': 0.7770257890224457,
'context': 'On December 21, 1968, Frank Borman, James Lovell, and William Anders became the first humans to ride the Saturn V rocket into space on Apollo 8. They ',
'offsets_in_document': [{'start': 22, 'end': 34}],
'offsets_in_context': [{'start': 22, 'end': 34}],
'document_id': '2bc046ba90d94fe201ccde9d20552200',
'meta': {
'context': "On December 21, 1968, Frank Borman, James Lovell, and William Anders became the first humans to ride the Saturn V rocket into space on Apollo 8. They also became the first to leave low-Earth orbit and go to another celestial body, and entered lunar orbit on December 24. They made ten orbits in twenty hours, and transmitted one of the most watched TV broadcasts in history, with their Christmas Eve program from lunar orbit, that concluded with a reading from the biblical Book of Genesis. Two and a half hours after the broadcast, they fired their engine to perform the first trans-Earth injection to leave lunar orbit and return to the Earth. Apollo 8 safely landed in the Pacific ocean on December 27, in NASA's first dawn splashdown and recovery.", 'title': 'Space_Race'
}
}>, <Answer {
'answer': 'Aldrin', 'type': 'extractive', 'score': 0.6680101901292801,
'context': ' were, "That\'s one small step for [a] man, one giant leap for mankind." Aldrin joined him on the surface almost 20 minutes later. Altogether, they spe',
'offsets_in_document': [{'start': 240, 'end': 246}],
'offsets_in_context': [{'start': 72, 'end': 78}],
'document_id': 'ae1c366b1eaf5fc9d32a8d81f76bd795',
'meta': {
'context': 'The first step was witnessed by at least one-fifth of the population of Earth, or about 723 million people. His first words when he stepped off the LM\'s landing footpad were, "That\'s one small step for [a] man, one giant leap for mankind." Aldrin joined him on the surface almost 20 minutes later. Altogether, they spent just under two and one-quarter hours outside their craft. The next day, they performed the first launch from another celestial body, and rendezvoused back with Columbia.', 'title': 'Space_Race'
}
}>
]
更新时间 6个月前