LlamaIndex로 검색 엔진 구축하기 (라마인덱스, openai-cookbook)

포스팅 개요
3. 간단한 QA
4. 심화된 QA
마무리

포스팅 개요

사내 ChatGPT 유즈 케이스 분석 스터디에서 사용할 예제를 찾던 중, 23/06/23(지금 기준 지난주)에 openai-cookbook에 올라온 예시가 재미져보여서 한 번 쭉 훑어보고 해석한 내용을 작성하였다. 이번 기회를 통해 생소한 툴인 라마인덱스가 도대체 뭐고, 어떻게 활용되는지를 간단하게 나마 이해할 수 있어서 만족스러웠다!

주제: LlamaIndex를 통해 대용량 문서를 읽고, 언어모델로 임베딩하고, 유사도에 기반하여 query를 처리하는 것을 매우 간단한 코드를 통해 수행할 수 있다. 심지어 하위 쿼리도 만들어서 처리해준다.
예제 내용: LlamaIndex를 이용하여 큰 크기의 재무제표에서 필요한 정보를 신속하게 추출하고, 여러 문서의 인사이트를 종합하여 재무 분석가를 지원하는 예제
원본 주피터 노트북 위치: financial_document_analysis_with_llamaindex.ipynb

0. 개념 정리

LlamaIndex란?

LLM 애플리케이션을 위한 데이터 프레임워크. 몇 줄의 코드만으로 시작하여 몇 분 안에 검색 증강 생성(RAG) 시스템을 구축할 수 있음. 고급 사용자를 위해 LlamaIndex는 데이터 수집 및 색인화를 위한 풍부한 툴킷, 검색 및 재순위를 위한 모듈, 맞춤형 쿼리 엔진 구축을 위한 컴포저블 구성 요소를 제공. 솔직히 뭔소리인지 잘 모르겠고, 예제를 통해 알아보자!

참고: https://gpt-index.readthedocs.io/en/latest/

1. 사용할 모델 지정

langchain을 통해 gpt-3.5 모델(text-davinci-003) 사용

# 언어 모델 선택
llm = OpenAI(temperature=0, model_name="text-davinci-003", max_tokens=-1)

# 앞으로의 작업에 대해 위에서 선택한 언어모델을 사용하도록 지정
service_context = ServiceContext.from_defaults(llm=llm)
set_global_service_context(service_context=service_context)

2. VectorStoreIndex 빌드 (문서 임베딩)

llama_index 라이브러리를 통해, 10-k의 재무제표 읽어 벡터화해둠
10-k: 미국에서 증권거래의 대상이 된 기업들이 미국 증권거래위원회에 1년에 한번씩 공시하여야 하는 서류 [나무위키: 10-k]

예시) 238페이지의 PDF 문서

# 문서 읽기
lyft_docs = SimpleDirectoryReader(
	input_files=["../data/10k/lyft_2021.pdf"]).load_data()

# 문서를 벡터화함
## 이 때, 자동으로 백단에서 문서를 자르고, 
## llm을 호출하여 임베딩한 뒤, 
## 그 임베딩 결과를 벡터로 하여 인덱싱해둠
lyft_index = VectorStoreIndex.from_documents(lyft_docs)

💡즉, 이전까지 힘들게 청킹하고 호출하고 저장하고 했던 작업을 코드 한 줄로 해결해줌. 심지어 PDF 포맷도 인식!

원래 큰 크기의 문서를 벡터화하려면 PDF 형식에 대한 제약, 크기에 대한 제약 때문에 별도의 길고 긴 코딩과 프롬프트 엔지니어링이 필요했는데, 그 작업을 VectorStoreIndex 를 통해 한 번에 해결해줌

3. 간단한 QA

쿼리엔진을 한 번 인스턴스화하면, 그 엔진에 aquery 메서드를 호출하여 응답받을 수 있음
쿼리 내용: “2021년 Lyft의 수익은 얼마입니까? 페이지 참조를 통해 수백만 단위로 답변하세요.”
3,208.3 만 달러였고, 63 페이지에 해당 내용이 있다는 응답을 받음

# 지정한 문서에 대해 검색을 수행할 쿼리 엔진 생성
lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)

# 검색해보기
response = await lyft_engine.aquery(
'What is the revenue of Lyft in 2021? Answer in millions with page reference')

$3,208.3 million (page 63)

4. 심화된 QA

쿼리엔진도구를 묶어서 하나의 쿼리엔진도구로 통합한 뒤, aquery를 호출하면, 쿼리를 수행하기 위해 하위 쿼리를 만들고, 그 결과를 취합까지 해줌.

query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine, 
        metadata=ToolMetadata(name='lyft_10k', 
				description='Provides information about Lyft financials for year 2021')
    ),
    QueryEngineTool(
        query_engine=uber_engine, 
        metadata=ToolMetadata(name='uber_10k', 
				description='Provides information about Uber financials for year 2021')
    ),
]

# 각각의 쿼리엔진도구를 하나로 통합
s_engine = SubQuestionQueryEngine.from_defaults(
			query_engine_tools=query_engine_tools)

# 통합한 엔진에 대해 쿼리
response = await s_engine.aquery(
	'Compare and contrast the customer segments and geographies that grew the fastest')

쿼리문을 처리하며 출력된 결과

Generated 4 sub questions.

[uber_10k] Q: What customer segments grew the fastest for Uber
[uber_10k] A: in 2021? The customer segments that grew the fastest for Uber in 2021 were its Mobility Drivers, Couriers, Riders, and Eaters. These segments experienced growth due to the continued stay-at-home order demand related to COVID-19, as well as Uber's introduction of its Uber One, Uber Pass, Eats Pass, and Rides Pass membership programs. Additionally, Uber's marketplace-centric advertising helped to connect merchants and brands with its platform network, further driving growth.

[uber_10k] Q: What geographies grew the fastest for Uber
[uber_10k] A: Based on the context information, it appears that Uber experienced the most growth in large metropolitan areas, such as Chicago, Miami, New York City, Sao Paulo, and London. Additionally, Uber experienced growth in suburban and rural areas, as well as in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain.

[lyft_10k] Q: What customer segments grew the fastest for Lyft
[lyft_10k] A: The customer segments that grew the fastest for Lyft were ridesharing, light vehicles, and public transit. Ridesharing grew as Lyft was able to predict demand and proactively incentivize drivers to be available for rides in the right place at the right time. Light vehicles grew as users were looking for options that were more active, usually lower-priced, and often more efficient for short trips during heavy traffic. Public transit grew as Lyft integrated third-party public transit data into the Lyft App to offer users a robust view of transportation options around them.

[lyft_10k] Q: What geographies grew the fastest for Lyft
[lyft_10k] A: It is not possible to answer this question with the given context information.

→ 하위 쿼리를 항목(segment, geographies)과 문서(lyft, uber)에 대해 생성하여 응답을 얻음

response 확인

The customer segments that grew the fastest for Uber in 2021 were its Mobility Drivers, Couriers, Riders, and Eaters. These segments experienced growth due to the continued stay-at-home order demand related to COVID-19, as well as Uber's introduction of its Uber One, Uber Pass, Eats Pass, and Rides Pass membership programs. Additionally, Uber's marketplace-centric advertising helped to connect merchants and brands with its platform network, further driving growth. Uber experienced the most growth in large metropolitan areas, such as Chicago, Miami, New York City, Sao Paulo, and London. Additionally, Uber experienced growth in suburban and rural areas, as well as in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain.
The customer segments that grew the fastest for Lyft were ridesharing, light vehicles, and public transit. Ridesharing grew as Lyft was able to predict demand and proactively incentivize drivers to be available for rides in the right place at the right time. Light vehicles grew as users were looking for options that were more active, usually lower-priced, and often more efficient for short trips during heavy traffic. Public transit grew as Lyft integrated third-party public transit data into the Lyft App to offer users a robust view of transportation options around them. It is not possible to answer the question of which geographies grew the fastest for Lyft with the given context information.

In summary, Uber and Lyft both experienced growth in customer segments related to mobility, couriers, riders, and eaters. Uber experienced the most growth in large metropolitan areas, as well as in suburban and rural areas, and in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain. Lyft experienced the most growth in ridesharing, light vehicles, and public transit. It is not possible to answer the question of which geographies grew the fastest for Lyft with the given context information.

→ 하위 쿼리 수행 결과를 문서별로 묶은 뒤, 두 문서에서 수행한 결과의 공통점과 차이점을 요약

마무리

랭체인이고, 라마인덱스고 계속 등장하는데, 생소했는데, 이번 기회를 통해 내 귀찮음을 줄여주는 놀랍고 편리한 도구임을 알 수 있었다. 다만, ChatGPT 커뮤니티에 물어보니까 이 도구가 좀 느리다는 얘기가 있어서, 서비스에 적용하기에는 무리일 수도 있겠다..!!

'인공지능' 카테고리의 다른 글

MNIST 데이터 읽기 (0)	2023.07.03
multi-label classification의 활성화 함수와 손실 함수 (0)	2023.07.01
n-gram과 n-gram 언어 모델 (0)	2023.05.31
어텐션 개념 이해하기 (어텐션과 K, Q, V) (0)	2023.05.26
[논문리뷰] NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (0)	2023.05.25

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

완벽하지 않은 완벽주의자

LlamaIndex로 검색 엔진 구축하기 (라마인덱스, openai-cookbook)

포스팅 개요

0. 개념 정리

LlamaIndex란?

1. 사용할 모델 지정

2. VectorStoreIndex 빌드 (문서 임베딩)

3. 간단한 QA

4. 심화된 QA

마무리

'인공지능' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

LlamaIndex로 검색 엔진 구축하기 (라마인덱스, openai-cookbook)

포스팅 개요

0. 개념 정리

LlamaIndex란?

1. 사용할 모델 지정

2. VectorStoreIndex 빌드 (문서 임베딩)

3. 간단한 QA

4. 심화된 QA

마무리

'인공지능' 카테고리의 다른 글

관련글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역