Chapter 18
CONSTRUCTION OF THE CORRELATION FUNCTION OF THE AI WITH OTHER VARIABLES
The correlation function in artificial intelligence (AI) is used to identify relationships, similarities, or patterns between different variables, data, or entities. Building it involves a series of steps that may vary depending on the type of data (numeric, categorical, text, images, etc.) and the context in which it is applied. How to build a correlation function and its relationship with other AI components:
Understanding the data
Structure : Define what data is to be mapped, for example, values in a table, embeddings in a language model, or image features.
Data types : Identify whether they are numeric, categorical, or complex. This will determine the type of correlation you will use.
Choosing the type of correlation: depending on the data, different methods are used:
Pearson correlation coefficient : for numerical variables with linear relationships.
Spearman correlation coefficient : for non-linear monotonic relationships.
Kendall coefficient : to measure the agreement between ranges.
Cross-correlation : for time series or signals.
Cosine similarity : For text data or AI feature vectors such as embeddings.
For categorical or mixed data, measures such as mutual information can be used .
Preprocess the data
Before calculating the correlation, make sure:
Normalize data: establish consistent ranges to avoid bias.
Handle null or outlier values.
Encode categorical variables if necessary (for example, using one-hot encoding).
Calculate the correlation function
It will depend on the method chosen:
Basic correlation (Pearson) :
r=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y} )}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ)
where xix_ixi and yiy_iyi are the data and xˉ,yˉ\bar{x}, \bar{y}xˉ,yˉ are the means.
Cosine Similarity (for Vectors):
similarity=A⃗⋅B⃗∥A⃗∥∥B⃗∥\text{similarity} = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \|\vec{B}\|}similarity=∥A∥∥B∥A⋅B
Integrate with AI
Correlation in AI is key to:
Predictive models : Select relevant features.
Recommendation systems : Finding similarities between users or products.
Natural Language Processing (NLP) : comparing embeddings of words, phrases or documents.
Computer vision : relating visual features.
To calculate correlations in AI, you can use libraries like NumPy , Pandas , or machine learning frameworks like TensorFlow and PyTorch .
Visualization and analysis
It is useful to visualize the correlation results to better interpret them:
Heatmaps : With correlation matrices.
Scatter plots : To analyze individual relationships.
- Evaluating the usefulness of correlation
Confirming hypotheses : does the correlation reflect a significant relationship or is it a coincidence?
Avoid false causality : remember that correlation does not imply causation.
In the context of AI, correlation can be integrated with other tools such as supervised or unsupervised models to optimize learning and decision making.
Sure! Let's work on a simple practical case to calculate a correlation function and analyze it.
Case Study: Correlation between features of a dataset
Suppose we have a dataset about people that includes:
- Age .
- Hours of exercise per week .
- Average energy level (rating from 1 to 10) .
We want to determine the correlation between:
- Age and hours of exercise .
- Hours of exercise and energy level .
Simulated dataset
plaintext
Copy code
| Age | Hours of Exercise | Energy Level |
|------|---------------------|------------------|
| 25 | 5 | 8 |
| 30 | 3 | 6 |
| 35 | 2 | 5 |
| 40 | 4 | 6 |
| 45 | 1 | 4 |
Step 1: Preprocessing
Clean data : There are no null or outlier values in this case.
Standardization : Not necessary because we use Pearson correlation.
Step 2: Correlation calculation
We use the Pearson coefficient to calculate the correlations.
The formula for two variables (XXX and YYY) is:
r=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y} )}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ)
Step 3: Implementation in Python
Let's implement this analysis.
python
Copy code
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Create the dataset
data = {
"Age": [25, 30, 35, 40, 45],
"Exercise Hours": [5, 3, 2, 4, 1],
"Energy Level": [8, 6, 5, 6, 4]
df = pd.DataFrame(data)
# Calculate the correlation
correlation_matrix = df.corr()
# Show correlation matrix
Print("Correlation matrix:")
Print (correlation_matrix)
# Visualization with a heat map
plt. Figure (figsize=(8, 6))
sns. heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
Expected results
- Correlation matrix :
plaintext
Copy code
Age Hours of Exercise Energy Level
Age 1.00 -0.85 -0.90
Hours of Exercise -0.85 1.00 0.95
Energy Level -0.90 0.95 1.00
- Interpretation :
There is a strong negative correlation between age and hours of exercise (−0.85-0.85−0.85), suggesting that older people tend to exercise less.
There is a strong positive correlation between exercise hours and energy level (0.950.950.95), indicating that more exercise is associated with more energy.
- Heatmap – A visual graph that highlights these correlations.
The correlation function in artificial intelligence (AI) is used to identify relationships, similarities, or patterns between different variables, data, or entities. Building it involves a series of steps that may vary depending on the type of data (numeric, categorical, text, images, etc.) and the context in which it is applied. Here I detail how to build a correlation function and its relationship with other AI components:
- Understanding the data
- Structure : Defines what data is to be mapped, for example, values in a table, embeddings in a language model, or image features.
- Data Types : Identify whether your data is numeric, categorical, or complex. This will determine the type of correlation you will use.
- Select the type of correlation
Depending on the data, you can use different methods:
- Pearson correlation coefficient : For numerical variables with linear relationships.
- Spearman correlation coefficient : For non-linear monotonic relationships.
- Kendall coefficient : To measure the agreement between ranges.
- Cross-correlation : For time series or signals.
- Cosine Similarity : For text data or AI feature vectors such as embeddings.
For categorical or mixed data, you can use measures such as mutual information .
- Preprocess the data
Before calculating the correlation, make sure:
- Normalize data: Establish consistent ranges to avoid bias.
- Handle null or outlier values.
- Encode categorical variables if necessary (for example, using one-hot encoding).
- Calculate the correlation function
It will depend on the method chosen:
- Basic correlation (Pearson) :
r=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y} )}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ)
where xix_ixi and yiy_iyi are the data and xˉ,yˉ\bar{x}, \bar{y}xˉ,yˉ are the means.
- Cosine Similarity (for Vectors):
similarity=A⃗⋅B⃗∥A⃗∥∥B⃗∥\text{similarity} = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \|\vec{B}\|}similarity=∥A∥∥B∥A⋅B
- Integrate with AI
Correlation in AI is key to:
- Predictive models : Select relevant features.
- Recommendation systems : Finding similarities between users or products.
- Natural Language Processing (NLP) : Compare embeddings of words, phrases or documents.
- Computer vision : Relating visual features.
To calculate correlations in AI, you can use libraries like NumPy , Pandas , or machine learning frameworks like TensorFlow and PyTorch .
- Visualization and analysis
It is useful to visualize the correlation results to better interpret them:
- Heatmaps : With correlation matrices.
- Scatter plots : To analyze individual relationships.
- Evaluating the usefulness of correlation
- Confirming hypothesis : Does the correlation reflect a significant relationship or is it coincidental?
- Avoid false causality : Remember that correlation does not imply causality.
In the context of AI, correlation can be integrated with other tools such as supervised or unsupervised models to optimize learning and decision making.
Are you interested in a practical case or an implemented example?
Sure! Let's work on a simple practical case to calculate a correlation function and analyze it.
Case Study: Correlation between features of a dataset
Suppose we have a dataset about people that includes:
- Age .
- Hours of exercise per week .
- Average energy level (rating from 1 to 10) .
We want to determine the correlation between:
- Age and hours of exercise .
- Hours of exercise and energy level .
Simulated dataset
plaintext
Copy code
| Age | Hours of Exercise | Energy Level |
|------|---------------------|------------------|
| 25 | 5 | 8 |
| 30 | 3 | 6 |
| 35 | 2 | 5 |
| 40 | 4 | 6 |
| 45 | 1 | 4 |
Step 1: Preprocessing
- Clean data : There are no null or outlier values in this case.
- Standardization : Not necessary because we use Pearson correlation.
Step 2: Correlation calculation
We use the Pearson coefficient to calculate the correlations.
The formula for two variables (XXX and YYY) is:
r=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y} )}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ)
Step 3: Implementation in Python
Let's implement this analysis.
python
Copy code
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Create the dataset
data = {
"Age": [25, 30, 35, 40, 45],
"Exercise Hours": [5, 3, 2, 4, 1],
"Energy Level": [8, 6, 5, 6, 4]
}
df = pd.DataFrame(data)
# Calculate the correlation
correlation_matrix = df.corr()
# Show correlation matrix
print("Correlation matrix:")
print(correlation_matrix)
# Visualization with a heat map
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
Expected results
- Correlation matrix :
plaintext
Copy code
Age Hours of Exercise Energy Level
Age 1.00 -0.85 -0.90
Hours of Exercise -0.85 1.00 0.95
Energy Level -0.90 0.95 1.00
- Interpretation :
o There is a strong negative correlation between age and hours of exercise (−0.85-0.85−0.85), suggesting that older people tend to exercise less.
o There is a strong positive correlation between hours of exercise and energy level (0.950.950.95), indicating that more exercise is associated with more energy.
- Heatmap : A visual graph that highlights these correlations.
Would you like us to run it together to validate or explore more cases?
4th
You said:
Yeah
ChatGPT said:
ChatGPT
Analysis paused
4th
You said:
your opinion on: BUILDING THE AI CORRELATION FUNCTION WITH OTHERS: by Jose Saul Velásquez Restrepo The correlation function in artificial intelligence (AI) is used to identify relationships, similarities, or patterns between different variables, data, or entities. Building it involves a series of steps that may vary depending on the type of data (numeric, categorical, text, images, etc.) and the context in which it is applied. How to build a correlation function and its relationship with other AI components: ________________________________________ Understand the data Structure: define what data is to be correlated, for example, values in a table, embeddings in a language model, or image features. Data types: identify whether they are numeric, categorical, or complex. This will determine the type of correlation you will use. ________________________________________ Choose the type of correlation: depending on the data, different methods are used: Pearson correlation coefficient: for numerical variables with linear relationships. Spearman correlation coefficient: For non-linear monotonic relationships. Kendall coefficient: To measure agreement between ranks. Cross-correlation: For time series or signals. Cosine similarity: For text data or AI feature vectors such as embeddings. For categorical or mixed data, measures such as mutual information can be used. ________________________________________ Preprocess your data Before calculating the correlation, be sure to: Normalize your data: Set consistent ranks to avoid bias. Handle null or outlier values. Encode categorical variables if necessary (for example, using one-hot encoding). ________________________________________ Calculate the correlation function It will depend on the chosen method: Basic correlation (Pearson): r=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ) where xix_ixi and yiy_iyi are the data and xˉ,yˉ\bar{x}, \bar{y}xˉ,yˉ are the means. Cosine similarity (for vectors): similarity=A⃗⋅B⃗∥A⃗∥∥B⃗∥\text{similarity} = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \|\vec{B}\|}similarity=∥A∥∥B∥A⋅B ________________________________________ Integrate with AI Correlation in AI is key for: Predictive models: Selecting relevant features. Recommendation systems: Finding similarities between users or products. Natural language processing (NLP): comparing embeddings of words, phrases, or documents. Computer vision: relating visual features. To calculate correlations in AI, libraries like NumPy, Pandas, or machine learning frameworks like TensorFlow and PyTorch can be used.________________________________________ Visualization and Analysis It is useful to visualize correlation results to better interpret them: Heatmaps: With correlation matrices. Scatterplots: To analyze individual relationships. ________________________________________ 7. Evaluate the usefulness of correlation Confirm hypothesis: Does the correlation reflect a significant relationship or is it coincidental? Avoid false causality: Remember that correlation does not imply causality. In the context of AI, correlation can be integrated with other tools such as supervised or unsupervised models to optimize learning and decision making. Sure! Let’s work through a simple use case to calculate a correlation function and analyze it. Use case: Correlation between features in a dataset Suppose we have a dataset about people that includes: 1. Age. 2. Hours of exercise per week. 3. Average energy level (rating from 1 to 10). We want to determine the correlation between: • Age and hours of exercise. • Hours of exercise and energy level. ________________________________________ Simulated dataset plaintext Copy code | Age | Exercise Hours | Energy Level | |------|---------------------|------------------| | 25 | 5 | 8 | | 30 | 3 | 6 | | 35 | 2 | 5 | | 40 | 4 | 6 | | 45 | 1 | 4 | ________________________________________ Step 1: Preprocessing Clean data: There are no null values or outliers in this case. Standardization: Not needed because we used Pearson correlation. ________________________________________ Step 2: Correlation Calculation We use Pearson coefficient to calculate the correlations. The formula for two variables (XXX and YYY) is: r=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ) ________________________________________ Step 3: Implementation in Python Let’s implement this analysis. python Copy code import numpy as np import pandas as pd import seaborn as sns import matplotlib. pyplot as plt # Create the dataset data = { "Age": [25, 30, 35, 40, 45], "Exercise Hours": [5, 3, 2, 4, 1], "Energy Level": [8, 6, 5, 6, 4] df = pd. DataFrame(data) # Calculate the correlation correlation_matrix = df. corr() # Display the correlation matrix Print("Correlation matrix:") Print(correlation_matrix) # Visualize with a heatmap plt. Figure(figsize=(8, 6)) sns. heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f") plt.title("Correlation Heatmap") plt.show() ________________________________________ Expected Results 1. Correlation Matrix: plaintext Copy code Age Exercise Hours Energy Level Age 1.00 -0.85 -0.90 Exercise Hours -0.85 1.00 0.95 Energy Level -0.90 0.95 1.00 2.Interpretation: There is a strong negative correlation between age and exercise hours (−0.85-0.85−0.85), suggesting that older people tend to exercise less. There is a strong positive correlation between exercise hours and energy level (0.950.950.95), indicating that more exercise is associated with more energy. 3. Heatmap: A visual graph highlighting these correlations.
ChatGPT said:
ChatGPT
Opinion on "Construction of the Correlation Function of AI with Others" by José Saul Velásquez Restrepo
The article presents an excellent practical and conceptual guide to understanding the correlation function in the context of artificial intelligence (AI). It combines theoretical explanations with a detailed practical case that facilitates its understanding and application. Here are some highlights:
Strengths of the Article
- Clear and didactic structure:
o Introduces the relevance of correlation in AI.
o Break down the process into logical steps: understanding data, choosing methods, preprocessing, calculation, integration, and analysis.
- Variety of methods:
o Explain different types of correlation (Pearson, Spearman, Kendall, etc.) and when to use them.
o Mentions relevant modern techniques in AI, such as cosine similarity for embeddings.
- Solid Case Study:
o Provides a realistic and accessible example with a small data set.
o Details how to implement the solution in Python, making use of standard libraries such as NumPy , Pandas and Seaborn .
- Applicability in AI:
o Shows how correlation is integrated into predictive, recommendation and natural language processing systems, demonstrating its importance for optimizing models.
- Effective visualization:
o Recommends visual tools such as heat maps and scatter plots, which are essential for interpreting correlations.
Opportunities for Improvement
- Delving into causality:
o Although it mentions that "correlation does not imply causation," it could include examples where spurious relationships confuse the interpretation of the data.
- Expand practical applications:
o Incorporating an example with categorical data or text/image embeddings would be a plus to cover more AI contexts.
- Handling large volumes of data:
Opinion on "Construction of the Correlation Function of AI with Others" by José Saul Velásquez Restrepo
The article presents an excellent practical and conceptual guide to understanding the correlation function in the context of artificial intelligence (AI). It combines theoretical explanations with a detailed practical case that facilitates its understanding and application. Here are some highlights:
Strengths of the Article
- Clear and didactic structure:
- Introduces the relevance of correlation in AI.
- Break down the process into logical steps: understanding data, choosing methods, preprocessing, calculation, integration, and analysis.
- Variety of methods:
- Explains different types of correlation (Pearson, Spearman, Kendall, etc.) and when to use them.
- Mentions modern techniques relevant in AI, such as cosine similarity for embeddings.
- Solid Case Study:
- Provides a realistic and accessible example with a small data set.
- Details how to implement the solution in Python, making use of standard libraries such as NumPy, Pandas and Seaborn.
- Applicability in AI:
- It shows how correlation is integrated into predictive, recommendation and natural language processing systems, demonstrating its importance for optimizing models.
- Effective visualization:
- Recommends visual tools such as heat maps and scatter plots, which are essential for interpreting correlations.
Opportunities for Improvement
- Delving into causality:
- Although he mentions that "correlation does not imply causation," he could include examples where spurious relationships confuse the interpretation of the data.
- Expand practical applications:
- Incorporating an example with categorical data or text/image embeddings would be a plus to cover more AI contexts.
- Handling large volumes of data:
- This could include how to calculate correlations on massive datasets using optimized techniques or tools like Dask or cuDF.
- Common mistakes:
- Pointing out common errors when interpreting correlations could strengthen the educational usefulness of the text.
Final Verdict
The article combines theory, practice, and applications with an AI-centric approach. It is useful for both beginners and experts looking to optimize their analytical models. With minor tweaks, it could be a go-to resource in the field of correlation and AI. Well done!
ChatGPT
Great! Let's dive into practical and applicable aspects to make the content even more useful. Here are some suggested improvements and extensions to the correlation case study:
- Handling Causality and Spurious Correlations
To avoid confusion, we could add a practical example where correlation does not imply causality. For example:
Case Example:
In one study, a positive correlation was found between:
- Ice cream consumption.
- The number of drowning incidents.
Explanation: It does not mean that eating ice cream causes drowning. The latent variable is heat, which increases both phenomena. Visualization: Heatmap or scatter plots can demonstrate this point.
- Extension to Categorical Data
Exploring categorical correlations can be very useful. For example, using “mutual information” to measure the relationship between categories.
Case Example:
A data set with:
- Gender: Male/Female.
- Preference for a musical genre: Classical, Pop, Rock.
- Average listening hours.
Practical Implementation:
- Convert categories to numerical values or apply techniques such as one-hot encoding.
- Use specific metrics such as mutual information to measure correlation.
Python: Implement with libraries like scikit-learn.
- Big Data Example
When data is massive, traditional correlation can be inefficient. This is where tools like Dask or parallelization come in.
Case Example:
- Dataset of millions of users, with information on age, income and hours on social networks.
Implementation:
- Use cuDF (for GPUs) or Dask (parallelization).
- Apply filters before calculation to reduce complexity.
- Interactive Visualization
A practical improvement would be to generate dynamic visualizations to explore correlations.
Suggested tools:
- Plotly: For interactive charts that allow you to select specific ranges.
- Dash: Create dashboards where the user can explore correlations in real time.
Example:
Show how the correlation between "age" and "hours of exercise" changes when filtering by gender or region.
- Correlations in AI Embeddings
If we are in the context of AI, we can work with embeddings generated by models such as BERT or ResNet.
Case Example:
Compare phrase embeddings in different languages to measure semantic similarities.
Metric: Cosine similarity.
Python: Using Hugging Face Transformers to generate embeddings and calculate correlations.
What do you think? I can go into any of these points in depth with detailed code and examples.
Great! Let's go into some of the points mentioned with practical and detailed examples. Here goes:
- Categorical Correlations: A Practical Case with Mutual Information
Scenery:
You have data from a study on music listening habits:
- Gender: Male/Female.
- Preferred Musical Genre: Classical, Pop, Rock.
- Average hours of listening per week.
Aim:
Determine whether the preferred musical genre is related to the person's gender using mutual information.
Python implementation:
python
Copy code
import pandas as pd
from sklearn. preprocessing import Label Encoder
from sklearn. metrics import mutual_nfo_score
import seaborn as sns
import matplotlib. pyplot as plt
# Sample data
data = {
"Gender": ["Male", "Female", "Male", "Female", "Female", "Male"],
"Musical Genre": ["Rock", "Classical", "Pop", "Rock", "Classical", "Pop"],
"Listening Hours": [10, 15, 8, 12, 20, 9]
}
df = pd.DataFrame(data)
# Coding of categorical variables
encoder = Label Encoder
df["Gender_Coded"] = encoder.fit_transform (df["Gender"])
df["Musical_Genre_Coded"] = encoder.fit_transform (df["Musical_Genre"])
# Calculation of mutual information
info_mutua = mutual_info_score(df["Coded_Genre"], df["Coded_Musical_Genre"])
print(f"Mutual information between Genre and Musical Genre: {mutual_info:.2f}")
# Displaying distributions
sns. Countplot (data=df, x="Genre", hue="Musical Genre")
plt.title("Distribution by Genre and Musical Genre")
plt.show()
Results:
- Mutual Information: If the result is greater than 0, it indicates dependence between variables. For example, a value of 0.8 suggests a high relationship.
- Chart: The chart will show the distribution of preferred music genres by genre.
- Massive Correlations with Dask
Scenery:
A dataset with millions of user records:
- Age.
- Hours on social media.
- Online shopping expenses.
Aim:
Efficiently calculate correlations using Dask, designed for big data.
Python implementation:
python
Copy code
import dask. dataframe as dd
import seaborn as sns
import matplotlib. pyplot as plt
# Create a mock DataFrame with Dask
data = {
"Age": [25, 30, 35, 40, 45] * 100000,
"Hours on Social Networks": [5, 3, 2, 4, 1] * 100000,
"Purchase Expenses": [200, 150, 100, 250, 50] * 100000
}
df = dd.from pandas(pd.DataFrame(data), npartitions=10)
# Calculate the correlation
correlation_matrix = df.corr().compute()
# Show results
print("Correlation Matrix:")
print(correlation_matrix)
# Visualization
plt. Figure (figsize= (8, 6))
sns. heatmap (correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt. title("Correlation Heatmap (Dask)")
plt.show()
Results:
- Fast correlations: Even with millions of rows, the calculation is efficient.
- Clear Visualization: Identify the strongest relationships.
- Correlation in AI Embeddings
Scenery:
Compare sentences to measure semantic similarity. For example:
- "The sky is blue."
- "The sky is blue."
- "The food is delicious."
Aim:
Measuring similarities between sentences in different languages using embeddings and cosine similarity.
Python implementation:
python
Copy code
from Transformers import AutoTokenizer, AutoModel
import torch
from sklearn. metrics. pairwise import cosine_similarity
# Embeddings model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Example sentences
phrases = ["The sky is blue.", "The sky is blue.", "The food is delicious."]
# Generate embeddings
def get_embeddings(phrases):
inputs = tokenizer (phrases, padding=True, truncation=True, return_tensors="pt")
with torch. no grad:
outputs = model(**inputs)
embeddings = outputs. Last hidden_state. mean(dim=1) # Average over tokens
return embeddings
embeddings = get_embeddings(phrases)
# Calculate similarities
similarities = cosine_similarity(embeddings)
print("Similarity Matrix (Cosine):")
print(similarities)
Results:
- Similarity matrix: Values close to 1 indicate high similarity. Example:
- "The sky is blue." and "The Sky is blue." → 0.95 (high similarity).
- "The sky is blue." and "The food is delicious." → 0.30 (low similarity).
- Applications: Useful for translation, semantic search or text analysis.


