Back to Projects

LLM Analytics and Evaluation

PythonClaudeLangChainScikit-LearnNLTKGensimLLMTransformers
AI Agent Development

Executive Summary

This project involves building a comprehensive data app for analyzing conversational data from AI assistants. The goal is to provide insights into various aspects of AI responses, including potential hallucinations, underlying topics, sentiment, temporal patterns, and message groupings. The dashboard leverages several natural language processing (NLP) and machine learning techniques to extract meaningful information from the data.

Dataset

The analysis is designed to work with JSON datasets containing conversational messages. Key fields expected in the data include:

  • content: The text of the message.
  • sender: Identifies if the message is from the 'candidate' (user) or 'bot' (AI).
  • createdAt: Timestamp of the message for time series analysis.
  • botThreadId: Identifier to group messages into conversations or threads.
  • Optional fields like intent and citations can also be utilized for richer analysis.

Models and Techniques

The project employs a variety of models and techniques for different analytical tasks:

  • Hallucination Detection: Utilizes the Patronus API to evaluate AI responses for factual accuracy and potential hallucinations.
  • Topic Modeling: Implemented using Latent Dirichlet Allocation (LDA) from libraries like Gensim and Scikit-learn to identify the main themes or topics present in the conversation data.
  • Sentiment Analysis: Performed using multiple approaches, including VADER (Valence Aware Dictionary and sEntiment Reasoner) for lexicon-based analysis, and a transformer-based model (RoBERTa via thetransformers library) for more nuanced sentiment detection. A custom weighted approach combining different methods is also explored.
  • Time Series Analysis: Involves analyzing message frequency over time (daily, weekly, monthly), identifying temporal patterns (hourly, daily, monthly distributions), and potentially forecasting future message volume using models like Prophet.
  • Text Embeddings & Clustering: Messages are vectorized using techniques like TF-IDF. Dimensionality reduction is applied using t-SNE, and messages are grouped into clusters using K-Means to find similar conversation segments.
  • Basic Text Analysis: Includes word frequency analysis (using NLTK) and analysis of message length distribution.

Analysis Details

The dashboard provides interactive visualizations and insights derived from the analysis:

  • Topic Modeling: Visualizations show the distribution of topics across the dataset and the key keywords associated with each topic.
  • Sentiment Analysis: Presents the overall sentiment distribution, average sentiment scores, and allows for comparison between different sentiment analysis methods.
  • Time Series: Displays trends in message volume over various time periods and highlights recurring patterns in activity.
  • Clustering: Visualizes message clusters in a 2D space and provides examples of messages within each cluster.
  • Hallucination Detection: Allows for on-demand evaluation of specific AI responses and provides hallucination scores.

Quantitative Insights and Findings

Based on the analysis of the conversation data, the following quantitative insights were observed:

  • A total of 3120 messages were analyzed.
  • Messages were evenly split between the sender ('candidate') and the bot, with 1560 messages each (50% for each sender).
  • The most frequent message intents were 'candidateJobAssist' (60.3%), 'candidateAssist' (21.8%), and 'candidateCompanyAssist' (14.1%).
  • Message lengths varied significantly, with a mean length of 597.2 characters, a standard deviation of 720.5, a minimum of 2, and a maximum of 2815 characters.
  • Analysis of message timestamps revealed peaks in activity during the Evening (36.5%) and Night (25.0%) hours.
  • Sentiment analysis using VADER showed that the majority of messages were classified as positive (78.2%), followed by neutral (19.9%), and a small percentage as negative (1.9%). The mean VADER compound sentiment score was 0.593.
  • Sentiment analysis using RoBERTa showed a different distribution, with most messages classified as neutral (66.4%), followed by positive (31.7%), and negative (1.9%).
  • The custom weighted sentiment analysis also indicated a high percentage of positive messages (80.8%), neutral (17.6%), and negative (1.6%).
  • A high correlation (0.96) was observed between the VADER and Custom sentiment scores, while RoBERTa scores showed a negative correlation with VADER (-0.18) and Custom (-0.11).
  • Topic modeling identified several key topics within the conversations; the distribution and key terms for these topics were analyzed to understand prevalent discussion themes.
  • Message pairing for hallucination detection identified, for example, 1190 potential candidate-bot conversation pairs.

Conclusion

This project demonstrates the application of various NLP and machine learning techniques to gain valuable insights from AI assistant conversation data. By analyzing topics, sentiment, temporal trends, and identifying potential hallucinations, the dashboard serves as a powerful tool for monitoring and improving the performance and quality of AI assistant interactions. The use of libraries like NLTK, Scikit-learn, Gensim, Transformers, and Patronus enables a multi-faceted approach to understanding conversational dynamics.