NEURAL NETWORKS, METRICS AND DATASETS FROM APPLE.
02/09/2025
Updated on 17/10/2025
In 2025 the neural network “fever” swept through corporations and brought them into a prolonged clash on the field of market ratings and benchmarks.
Despite marketing efforts, the main weapon of most market players is the practical side of using neural networks.
The recent triumph of Google Veo3 in the field of video generation has long defined the direction of content generator development and provided the search giant with a successful marketing campaign for the artistic neural network NANOBANANA, as well as many other effective projects.
At the same time, the largest niche player, OpenAI, has faced some uncertainty since the release of ChatGPT-5. A noticeable reduction in usage limits clarified the situation with OpenAI’s capabilities. The marketing campaign to promote the new model showed that even the leader can make mistakes. Since the times of ChatGPT-3 (2022), the company had been at the peak of technology and served as an example for others.
A similar situation is now observed with xAI, Anthropic and Perplexity AI. Companies occasionally get into scandals and lose promising contracts. All the more so as powerful Chinese competitors — Alibaba, DeepSeek, Tencent, Baidu, Moonshot and others — are right on their heels.
Almost absent from this entire story is the company that ranks third in the world by market capitalization in the market cap ratings.
Apple has confidently held its position in the premium smartphone market for many years. Demand for the company’s wide product line has not decreased even today. Users value Apple gadgets for their simplicity and security. There are no prerequisites for a decline in demand in the future either.
But technology moves forward, and although the Google Pixel 10 smartphone with Tensor G5 processors did not show the best test results, it confidently occupied the niche of smartphones with neural network integration in the form of voice control.
Nevertheless, Apple is not sitting idly by. We have already heard about the integration of ChatGPT into iOS applications and about the manufacturer’s agreement with Alibaba due to the latter’s large trade turnover. Also, Apple plans to integrate the Perplexity search engine into the internet search of its devices, which in July 2025 released the already scandalous Comet browser.
In addition, Apple has released a number of neural network models mainly focused on visualization.
The company publishes its public models on the popular platform Hugging Face. There are currently over 140 models available.
By visiting the collections, we can see that the models are grouped into families:
1. FastVLM: convert an image into a set of compact visual tokens. This reduces processing latency and speeds up response generation. They use the hybrid video encoder FastViTHD, which decreases delays in processing visual information while maintaining high accuracy on visual understanding and text generation tasks.
2. MobileCLIP2: mobile-friendly models for linking images and text. They are trained so that an image and its description have close vector representations (embeddings). This enables them to understand and find semantic similarities between images and text.
3. DiffuCoder: diffusion-based large language models focused on code generation. Instead of the usual left-to-right text construction, they start from a “noisy” version of the code and gradually refine it, producing the final result in parallel across the entire code rather than token by token. This approach allows the models to form the structure of the code at once, rather than building it line by line, which is especially valuable for complex code.
4. AIMv2: large visual encoders with multimodal autoregressive pretraining: simultaneously generate image fragments and text tokens in a single sequence. This approach improves “image-text” interaction and achieves high accuracy on object classification tasks in both images and text.
5. Core ML: visual models for object detection based on the YOLOv3 standard by Ultralytics and for depth estimation of images. In addition, coreml-FastViT is designed for object classification in images. All models are built to run on the Core ML framework for on-device inference, operating directly on Apple devices with automatic utilization of available CPU and GPU resources.
6. OpenELM Instruct Models (Efficient Language Models): transformer language models capable of understanding and following instructions. Accelerated through special fine-tuning on instructions, which allows them to better understand task formulation and make fewer errors when following specific directions. According to the description of ELM models in the release paper on arXiv, the development team provided not only the weights but also a full platform for training and evaluating the language model on publicly available datasets, including training logs, several checkpoints, and pretraining configurations.
7. OpenELM Pretrained Models: the same transformer language models as the previous ones, trained on large text corpora (plain text without annotations) but not fine-tuned on instructions.
8. MobileCLIP Models + DataCompDR Data: multimodal models designed for efficient work with images and text on mobile devices. They provide high performance with minimal resource requirements. Thanks to their convolutional and linear layers, they can recognize object classes in images they have never seen. Image splitting is performed by convolutional layers — dividing the image into small patches, e.g., 16×16 pixels for feature extraction (color, shape, texture) and converting the patches into vectors. Each vector is assigned positional embedding — a list of feature coordinates in multidimensional space, so the model understands the location of each patch in the image. In the transformer part of the neural network, patches interact with each other and form the overall context of the image. After this, patch vectors are combined into a single vector with context in a linear layer. This context (the overall class feature) is the factor that aligns the image vector with vectors of text of similar context. Thus, the model understands that an image it has never seen can be described by the text most closely matching the context of the image.
9. TiC-CLI: models designed for continual learning (image + text) taking into account the temporal evolution of data by timestamps in datasets. They can efficiently fine-tune on incoming data over time without the need for full retraining. This collection also includes benchmarks to check temporal stability, forgetting, and the speed of neural network adaptation to fine-tuning.
10. DepthPro: specialized neural networks for high-precision generation of 3D depth maps (grayscale) from images. Depth maps are used in augmented reality, medical, and automotive technologies.
11. Core ML Stable Diffusion: diffusion models optimized for various devices under the Core ML framework, converting textual descriptions into images. They use a latent space (compressed feature map, not pixels) for image generation, allowing high detail with relatively low computational costs.
12. Core ML FastViT: hybrid networks combining convolutional layers with large kernels and structural reparameterization (merging several trained layers into one optimized layer). This is critical for training, but during deployment on devices it accelerates its operation.
13. Core ML Depth Anything: additional models for creating depth maps of images, in this case for a single image (AR and other applications for estimating distances to objects).
14. DFN Models + Data: additional CLIP models trained on filtered data intended for image classification by text queries without further training. During their training, Data Filtering Networks were used – small neural networks automatically selecting high-quality image-text pairs from massive datasets. This collection expands the choice of models in terms of speed/accuracy.
15. AIM: additional autoregressive models for image classification, trained using unannotated data (without labels). With an increased number of parameters, they adapt well to various tasks.
16. DCLM: language models trained on a high-quality automatically annotated dataset.
17. Core ML Segment Anything 2: models adapted for use with the Core ML framework for object masking in images and videos Segment Anything Model 2 by Meta.
As we can see, Apple’s main focus is on neural networks working with ready-made images and real-time capture. There is a visible trend toward creating universal language models for processing both visual and textual data.
Among datasets, Apple offers large collections of annotated images, image/text pairs, collections of tasks for mathematical reasoning, a collection of cleaned language data, a text corpus (unannotated text), a collection of texts with high information density, a collection of synthetic captions, embeddings and metadata, and a collection of Apple stock financial data for 2025.
There is also a benchmark for testing the capabilities of neural network agents in various domains. Useful for testing universal agents. Currently, the Hugging Face generator does not allow downloading the benchmark due to a table error – some columns are missing, which prevents consolidating the data into a single format. However, you can find this benchmark in Apple’s GitHub repository, along with many other company developments.
There, you can also find, for example, the AXLearn library, which operates within the JAX machine learning ecosystem – a Python library for accelerated array computations and program transformations, developed by Google with contributions from Nvidia and other community members, designed for high-performance numerical computing and large-scale machine learning.
JAX is expanding towards modularity to ensure operation on all processors. At the moment, the JAX team has developed and plans to add to the ecosystem a unified device API in the runtime environment PJRT – PJRT plugins, adapted for specific devices. This approach will ensure the universality of JAX, which, for additional functionality on a particular device, will call the PJRT plugin corresponding to that device.
This is directly related to the development of a PJRT plugin by Apple’s team for running JAX across all Apple devices. Moreover, Apple points out the acceleration of JAX on Mac platforms thanks to the use of the new Metal plugin, which operates on the machine learning compiler OpenXLA from Google with open-source code. Taking into account the experience of developing the Dart language as well as Flutter itself, such cooperation between Google and Apple can be considered promising - provided that the rules are not violated.
The prospect of using the latest technologies for training and running neural networks may allow Apple to improve the architecture of its own neural networks, as well as optimize devices and the ecosystem by finally solving the problems of integration and fragmentation observed today in the field of machine learning.
In its discrete model FS-DFM (Few-Step Discrete Flow-Matching), developed together with researchers from The Ohio State University (USA), the company tested a flow-matching technique that makes the number of sampling steps an explicit parameter. The model generates tokens in parallel, updating the entire text in a single pass with a large number of iterations. Each iteration reconstructs the whole text from noise, relying on the statistical patterns of word sequences from the training corpus.
Whatever Apple’s future plans may be, such news brings hope that a company with vast resources will make a tangible contribution to the development of neural networks in pursuit of the common goal – the creation of AGI.
One can wish Apple success in this direction, and its fans – patience.
Visit the SAID-Test to practice identifying fake generations.
said correspondent🌐
You can create a separate thread on the community forum.
Comments