Real-World Comparison of Artificial Intelligence (AI) Large Language Models (LLMs) on OCI

Introduction

Oracle Cloud has several Large Language Models available within OCI itself that can be used by your applications and all usage is charged though OCI therefore making this LLM-As-A-Service. The most important feature that we consider for this is that all data movement is contained to OCI and no data is shared with the provider of the LLM, data is sovereign to the OCI Region if the LLM is available within your region, otherwise just your region and the LLM region. If you are using company confidential information with LLMs to gain valuable insights, you want that data private and OCI ensures that is the case.

Comparing Large Language Models (LLMs)

LLMs on OCI

Oracle has the following LLMs on OCI, they are not available for all regions. For the examples below, we are running our workloads in OCI Sydney and using the LLM from London, United Kingdom. This means that data will be travelling from SYD to LON and back again.

Cohere Models

Cohere Command R (08-2024)
Cohere Command R+ (08-2024)
Cohere Command A (03-2025)

Meta Models

Meta Llama 4 Maverick
Meta Llama 4 Scout
Meta Llama 3.3 (70B)
Meta Llama 3.2 90B Vision
Meta Llama 3.2 11B Vision
Meta Llama 3.1 (405B)
Meta Llama 3.1 (70B)
Meta Llama 3 (70B)

xAI Models

xAI Grok 3
xAI Grok 3 Mini
xAI Grok 3 Fast
xAI Grok 3 Mini Fast

Embed Models

Cohere Embed English Image 3
Cohere Embed English Light Image 3
Cohere Embed Multilingual Image 3
Cohere Embed Multilingual Light Image 3
Cohere Embed English 3
Cohere Embed English Light 3
Cohere Embed Multilingual 3
Cohere Embed Multilingual Light 3

Rerank Model

Cohere Rerank 3.5 (New)

That is a lot of models, and they do change reasonably frequently. Different models specialise in different use cases. You should do research and testing to determine what model is right for your use case.

Comparing Cohere LLMs

For the exercise we perform below we are doing some analysis of survey results, and we will be using the three Cohere models to illustrate some differences that occur

Model	Description	Maximum Input Length	Maximum Output Length	Cost
command-r-08-2024	command-r-08-2024 is an update of the Command R model, delivered in August 2024.	128k	4k	$0.00135
command-r-plus-08-2024	command-r-plus-08-2024 is an update of the Command R+ model, delivered in August 2024.	128k	4k	$0.0234
command-a-03-2025	Command A is our most performant model to date, excelling at tool use, agents, retrieval augmented generation (RAG), and multilingual use cases. Command A has a context length of 256K, only requires two GPUs to run, and has 150% higher throughput compared to Command R+ 08-2024.	256k	8k	$0.0234

‍

Note that 'Cost' refers to the charges per 10,000 characters sent to the LLM plus the number of characters generated by the model. In our case, we cause the AI to ingest our survey results which is approximately 16,100 characters and the length of the result varied, but was generally 1,000 to 2,500 characters. Estimating cost would be ((16,100 + 1,000) / 10,000) * 0.00135 for the Command R model which results in an approximate $0.00231, therefore if you ran this query 100 times you would be charged $0.23085. As we are operating from OCI Sydney, all charges stated here are Australian Dollars.

To estimate the cost for the more expensive and comprehensive models, the cost would be calculated as ((16,100 + 2500) / 10,000) * 0.0234 which results in an approximate cost of $0.043524 per invocation. That is a vast difference in cost (approx 19 times) and this may have merit in determining which model to use for which use case.

The Use Case

Pebble IT has a demonstration Apex application titled "Worker Management System" (WMS) that is run for a fictitious company "Friendly Contracting Services" (FCS). Its primary role is to record all contractor details including their insurance requirements so that they have the necessary coverage for performing their work. In addition, this application performs survey results of their satisfaction working for FCS and to capture any safety or other concerns that management should be aware of.

Whilst survey results can be viewed individually and numerically summarised, text based examination of survey results is very difficult for people to examine, particularly when there are hundreds of survey results. We are using AI combined with document generation to build a management report that we would distribute each month that summarises the results but also highlights anything of significance. There are two major steps being performed at the request of the user:

Perform multiple requests of the AI to answer questions about our survey results and store the responses in the database
Build a Microsoft Word Report based on the findings determined so that it may be distributed to HR & Senior management on a periodic basis. Typically this would be a monthly process, therefore considering the costs above, it would be near negligible to use the more expensive models as it is likely to be performed 12-20 times per year. The report itself requires approximately 20 AI prompts, so to estimate the annual cost: 20 reports x 20 prompts x ((16,100 + 2,500) / 10,000) * A$0.0234 which equates to an approximate A$17.41 per year which is a very reasonable cost as this does represent a very low use of AI, but valuable.

Note that WMS is not a real application that we sell to clients, it is a theoretical demonstration aplication that shows Apex UI, workflow, documentation generation and now AI capabilities.

The Results

We will share with you the summarisation section from the report that asks the following question: "Can you give me any trends regarding stress in my workforce?". The results are quite different across the 3 models and this illustrates the point that the differences in model deployed can be significant.

Cohere Command R

Workforce Stress Survey Summary

Based on the survey responses, there are some notable trends regarding stress in your workforce:

High Perceived Job Stress: A significant portion of your workforce (approximately 40%) feels that their jobs are more stressful than they had initially imagined. This is indicated by the responses to the question, "My job is more stressful than I had ever imagined", where a large number of respondents selected "Somewhat agree" or "Strongly agree."
Discrepancy Between Stress and Job Quality: There seems to be a discrepancy between the perceived stress levels and the actual job quality. While many employees feel their jobs are stressful, a considerable number also agree that they use many of their skills and abilities at work. This suggests that the stress may not necessarily be due to a lack of job satisfaction or engagement but could be influenced by other factors such as workload, work environment, or external pressures.
Understanding of Health and Safety: The responses also indicate that a majority of employees are clear about their rights and responsibilities regarding workplace health and safety. However, there is a slight discrepancy in the responses to the question, "I am clear about my rights and responsibilities in relation to workplace health and safety", with some employees selecting "Neither agree nor disagree." This could suggest that while most employees are aware of their rights, there may be a need for further education or reinforcement to ensure a consistent understanding across the workforce.
Unfamiliar Work Tasks and Stress: The survey data also highlights a potential correlation between job stress and the frequency of performing unfamiliar work tasks or using unfamiliar work methods. Employees who reported performing such tasks "Every month" or "Every week" were more likely to agree that their jobs were more stressful than imagined. This could indicate that providing adequate training, support, and resources for employees to navigate unfamiliar work situations could help mitigate job stress.
Flexible Work Arrangements: Finally, the survey responses suggest that a significant number of employees work in a flexible environment, either working from home or traveling between multiple worksites. While this flexibility can offer benefits, it may also contribute to job stress, especially if employees feel they lack the necessary support or resources to manage their work effectively in these environments.

Overall, the survey responses indicate that job stress is a prevalent issue within your workforce. To address this, it may be beneficial to implement strategies that focus on workload management, providing adequate resources and support for employees, and ensuring a healthy work-life balance. Additionally, regular check-ins and open communication channels can help identify and address stress-related concerns promptly.

‍

Cohere Command R+

Workforce Stress Survey Summary

Based on the survey responses, there are some interesting trends and insights related to stress in your workforce:

Stress Levels: The majority of respondents seem to agree or somewhat agree that their job is more stressful than they had imagined. This indicates a general perception of higher stress levels among the workforce.
Role and Stress: The responses suggest that workers and supervisors/managers are more likely to perceive their jobs as stressful. This could be due to the nature of their roles and the responsibilities they carry.
Industry and Stress: The data does not provide a clear correlation between industry and stress levels. However, it is worth noting that respondents from the "Information Media and Telecommunications" industry consistently reported higher stress levels.
Workplace Size and Stress: There is no significant pattern between workplace size and stress. However, it is notable that employees in organizations with 20-199 workers seem to have varying stress levels, with some agreeing and others disagreeing with the stress statement.
Work Structure and Stress: The survey responses do not indicate a direct relationship between work structure and stress. However, it is worth exploring further as remote work and flexible work arrangements may have different stress factors compared to traditional office settings.
Frequency of Unfamiliar Tasks: Interestingly, employees who perform unfamiliar tasks more frequently (every week or every day) did not report higher stress levels. This might suggest that regular exposure to new tasks does not necessarily equate to increased stress.

Here are some recommendations based on these insights:

Address Stress in Specific Roles: Focus on stress management and support for workers and supervisors/managers, as they seem to be more affected by stress. Provide targeted training and resources to help them cope with job-related stress.
Industry-Specific Interventions: Consider industry-specific stress management programs, especially for the "Information Media and Telecommunications" sector, as employees in this industry consistently reported higher stress levels.
Regular Check-Ins: Implement regular check-ins and feedback sessions to monitor stress levels and address concerns early. This can help identify stress hotspots and allow for timely interventions.
Flexible Work Arrangements: Further research the impact of flexible work arrangements on stress levels. Some employees may benefit from flexible work, while others might prefer more structured environments. Tailor work arrangements to individual needs.
Training and Support for Unfamiliar Tasks: Provide adequate training and support for employees when introducing new tasks or work methods. This can help reduce stress associated with unfamiliar work.

By understanding these trends and implementing targeted interventions, you can work towards creating a healthier and more supportive work environment for your employees.

‍

Cohere Command A

Workforce Stress Survey Summary

Based on the survey responses, here are the key trends regarding stress in your workforce:

Stress Levels Vary by Employment Type:
- Casual Workers (e.g., LANITA.DELOYE, NORBERTO.FEBRE, AMIT.DASGUPTA) report Strongly agree or Agree to the statement "My job is more stressful than I had ever imagined."
- Permanent Workers (e.g., REIKO.WHITE, RAMANO.GONZALEZ, LANCE.YEAKEL) show more mixed responses, ranging from Disagree to Somewhat agree.
Stress and Job Role:
- Workers (e.g., RAMANO.GONZALEZ, LANELL.FLORCZAK) tend to report higher stress levels (Somewhat agree) compared to Supervisors/Managers (e.g., REIKO.WHITE, LANCE.YEAKEL), who show more neutral or disagreeing responses.
Stress and Industry:
- Employees in Information Media and Telecommunications (e.g., LANITA.DELOYE, NORBERTO.FEBRE, AMIT.DASGUPTA) consistently report higher stress levels (Strongly agree).
- Those in Financial and Insurance Services (e.g., INDU.SHARMA, NORENE.PENSKY, NOVELLA.SHACKLETON) show moderate stress levels (Somewhat agree).
Stress and Work Location:
- Employees who work flexibly (e.g., IVANA.HORVAT, ABENI.BELO) report moderate stress levels (Agree).
- Those who only work at home (e.g., RAMANO.GONZALEZ, LANELL.FLORCZAK) also report moderate stress levels (Somewhat agree).
Stress and Business Size:
- Employees in organizations with 20-199 workers (e.g., LANITA.DELOYE, NORBERTO.FEBRE, IVANA.HORVAT) show higher stress levels compared to smaller organizations (e.g., RAMANO.GONZALEZ, LANELL.FLORCZAK in 1-4 workers).
Stress and Frequency of Unfamiliar Tasks:
- Employees who perform unfamiliar tasks every month (e.g., LANITA.DELOYE, NORBERTO.FEBRE, AMIT.DASGUPTA) report higher stress levels (Strongly agree).
- Those who perform unfamiliar tasks every 3 months or less frequently (e.g., INDU.SHARMA, NORENE.PENSKY) report lower stress levels (Somewhat agree).

Actionable Recommendations:

Targeted Support for Casual Workers: Implement stress management programs specifically for casual workers, who report higher stress levels.
Industry-Specific Interventions: Focus on reducing stress in high-stress industries like Information Media and Telecommunications through tailored wellness initiatives.
Flexible Work Arrangements: Monitor stress levels for employees working flexibly and ensure they have adequate support to manage their workload.
Training for Unfamiliar Tasks: Provide regular training to reduce the frequency of unfamiliar tasks, especially for employees performing them monthly.
Business Size Considerations: Address stress in medium-sized organizations (20-199 workers) by enhancing resources and support systems.

These insights can help you develop targeted strategies to mitigate stress and improve overall job satisfaction in your workforce.

‍

Key Differences Between The Models

Command R

We feel that "Command R" would be fine for testing purposes, but provides insufficient insights for production level data and it is quite likely using this model that we would miss important points and instead have to rely upon human scrutiny of the results.

Command R+

This is a big improvement, and at 19x the cost, it definitely should be. This gives us real insights and introduces the 'Actionable Recommendations'. The AI has indicated to us that there are some stress issues within our contractor workforce and that would then be sufficient justification to do a detailed analysis on the results so that a remediation plan can be created and executed. Whilst you may form your own opinion on how truly useful this insight may be, we believe incorporating more comprehensive prompts (and we acknowledge that our prompt is a very simple prompt to keep it basic), even further insight could be achieved to deliver greater value.

The important point is that this model is likely to be seen as production-ready whereas the Command R model is not.

Command A

Another leap forward but at little incremental cost (the result is larger so a slightly more expensive cost is incurred). We were pleasantly surprised at this. You may wonder that the AI here might publish confidential details here, and it definitely is capable of that if the contractor was prepared to share confidential details in a survey, then it could definitely be surfaced in a summary report, particularly if it contained criminal or graphic descriptions. The AI took it upon itself to name key people for particular classifications. Note that all the names you read here are from our AI generated data set, so do not relate to real people, and the survey results are also manufactured, not copied from real results elsewhere.

The information provided is both insightful and actionable. This is the level of information we would want in our survey results report. However, due to the detailed analysis it does provide, its output requires human moderation. This is not a set-and-forget exercise that you would hand over 100% trust to AI. The value that this would deliver to a staff member who is responsible for this is a thorough examination of the results and allow the staff member to gold-plate the subsequent report without spending too much time compared to a completely manual exercise.

Conclusions

When we started experimenting with the different models, we were not expecting such stark results. Hence this article. After experimenting, it has drawn us to some points worth discussing further:

Models suitable for Testing verses Production
Estimating Costs
Human Oversight

AI and LLMs are the domains of highly specialised experts, however the vendors are making these models and capabilities available to the world who then implement them in their own way with their own use cases. Due to the power of these models and the capabilities of AI, it is going to have serious consequences as we all make assumptions about what we read, hear and see. All models are not equal and we believe we have illustrated how the differences can be stark. This article is the very simplest of comparisons, its intention is really awareness as opposed to any sort of measurement model or decision-making tree.

Testing verses Production

Managing cost is the primary driver here. It is desirable that Development and System Test environments use cheaper models that have a strong relation to the more expensive Production model to verify that the model and AI has been correctly implemented. Production, UAT and Training environments should use the comprehensive models to show the highest quality outcomes that are desired.

Estimating Costs

The use of LLMs can have their costs estimated. It is Quantity of invocation multiplied by (input + output). The key is to understand all 3 of those metrics. Plus non-Production environments need to be considered. Your business may encounter significant model and AI costs in your Dev/Test/UAT/Train environments and should not be ignored. Their usage should be included in all budget estimates. Application stakeholders and business analysts will be your primary source of quantity of invocations, the input and output size is more likely to be the domain of the developers. Together costs can be calculated.

AI and LLMs are yet another source of variable spend that can easily spiral out of control, much like Could computing costs did in the early years of 2015 onwards. No doubt more cost control options will be released by Oracle to help manage costs, but until that occurs, quality budgetary calculations will be very important. To illustrate the point, a Virtual Machine from Oracle may cost $90 per month, by running 2,000 invocations against a Cohere Command A model of an average of 10,000 characters in total of input + output will result in $1,450 per month, far more significant than the cost of the VM.

Human Oversight

We are at an AI dawn (I express it this way because there may be more than one as we approach AGI), there is a lot of differing opinions about replacement of jobs and new jobs being established as a result of AI. What is clear is that AI is a new type of computing paradigm that is less structured than what we are traditionally used to. We have not had a lot of time to adjust. Change is happening too fast for comfort. We are comfortable with 'deterministic' programs and we test expecting that the same inputs will achieve the same outputs. Payroll is a good example of a structured program that takes a number of inputs and will give the same output each time and this can be measured across different systems, typically as parallel pay runs are performed on an old payoll verses a new payroll. AI represents 'non-deterministic' and whilst achieving similar results each time, they will not be exactly the same each time it is invoked with the same inputs. You will note that most AI systems like 'Perplexity.ai' have a "regenerate" button that is to be used if you do not like the answer you are provided.

It is because of this unpredictability that human oversight is most likely required in the vast majority of use cases. In our simple example, I would definitely want to verify any extreme results that have been entered into our surveys. For example, an aggrieved contractor might post profanities or significant lies that would cause concern if included in a report sent to senior managers around the organisation. We need to understand the value of AI, and that is it is being a tool of efficiency for our workers, not being the worker. I trust humans, and whilst I will happily use AI, I will not trust it completely. Expert prompts can minimise the occurrence of hallucinations by requesting the AI to only use data provided and not to make inferences that are not present in the data. What AI does not have is good judgement. If I were responsible for the monthly survey report, I would exclude the profanity-laden rant submitted by an irate contractor whereas AI would likely treat this as a real example that needs to be highlighted and effectively skewer the report.

Final Thoughts

There is a lot of thought and preparation that needs to be undertaken as part of adopting AI. Whilst it may seem obvious, selection of the right model is a task that needs to be placed in every project, and understanding by stakeholders that different models can be employed for the same purpose but in different environments to lower costs. The Oracle OCI capability of housing LLMs makes AI very accessible and is an excellent starting point for experimenting to understand what is possible. Oracle Apex is a great tool that has built-in capability to invoke AI quickly so that money is spent on understanding the use of AI as opposed to building the framework to get to that understanding.

We have been keen observers of AI for the past 4 years and we have been very careful in our undertakings in this space. We have not jumped in with both feet, but have moved beyond dipping our toe in, and we believe we have insights to share and have good capability with AI in the context of Oracle Apex on OCI. Feel free to reach out to us here if you wish to discuss further.

‍

Real-World Comparison of Artificial Intelligence (AI) Large Language Models (LLMs) on OCI

Introduction

Comparing Large Language Models (LLMs)

LLMs on OCI

Cohere Models

Meta Models

xAI Models

Embed Models

Rerank Model

Comparing Cohere LLMs

The Use Case

The Results

Cohere Command R

Workforce Stress Survey Summary

Cohere Command R+

Workforce Stress Survey Summary

Cohere Command A

Workforce Stress Survey Summary

Key Differences Between The Models

Command R

Command R+

Command A

Conclusions

Testing verses Production

Estimating Costs

Human Oversight

Final Thoughts

Related Articles

Transforming Businesses Like Yours