ZhiPu GLM 4.5 AI, over the hill.
ZhiPu GLM 4.5 AI, over the hill.

ZhiPu AI, Overcoming the Obstacles.
At the beginning of this year, the landscape of China's “Six Dragons of AI” was disrupted by the open-source platform DeepSeek, throwing the original momentum into disarray. At the time, the industry was largely bearish on other AI startups, and ZhiPu was among those on the list.
However, as we have seen, AI is far from reaching a stage where winners are decided. Even OpenAI could potentially be overtaken by Anthropic. Everything is changing rapidly.
Recently, companies like DeepSeek have become increasingly quiet. This is somewhat like watching a movie: audiences are forgetful, and no one will unconditionally support a particular director. Ultimately, it comes down to the quality of the work and the product itself.
Just now, Zhipu released its latest visual reasoning model, GLM-4.5V. You can give it a try—the results are quite impressive.
In terms of technical details, GLM-4.5V still uses the most advanced MoE architecture, with a total of 106B parameters and 12B activation parameters. It has achieved the highest level among open-source models of the same class in multimodal tasks such as image understanding, video analysis, GUI screen recognition, and document interpretation.
Below are the benchmark data released by the official website. As you can see, GLM-4.5V performs the best among open-source models of the same level.
Even people in my WeChat Moments are saying that what GPT-5 didn't bring, Zhipu has now delivered.
Compared to text reasoning, visual reasoning models can understand visual-related information like humans do.After all, in the real world, most of the information we receive is not in the form of pure text, but comes from various forms such as pictures, videos, interfaces, and charts.
Being able to understand these images and combine them with language and knowledge for reasoning is closer to the way humans think.
For example, during the summer vacation, my child is learning English, and I want AI to help him learn the words in his textbook.In such a scenario, the value of visual reasoning models becomes clear. I only need to take a photo of the textbook or let the AI see the textbook via video call, and it can directly recognize the page content and practice vocabulary with the child.
At the beginning of last month, Zhipu open-sourced GLM-4.1V-9B-Thinking. Although this model has a relatively small number of parameters, its performance is already outstanding. After its release, it quickly topped the Hugging Face Trending chart.
Today's release of 4.5V continues the GLM-4.1V-Thinking architecture and replaces the base with Zhipu's latest GLM-4.5-Air.
Below is the technical report from the 4.1V release. If you are interested, you can search and read it. It includes many technical details, including their model architecture and training process.
Without further ado, let's take a look at two evaluations.
I used GLM-4.5V to do a page replication test. First, I took a screenshot of the Granola homepage, then uploaded the image directly to GLM-4.5V and entered the prompt: “Help me replicate this page.”
In less than a minute, the page was completely replicated. I recorded the entire process on video, which you can watch. There was no acceleration, just the original results.
Next, let's take a look at the cloning results. The left side is the actual homepage of Granola, and the right side is the page generated by GLM-4.5V. When compared side by side, they are almost identical, with high fidelity in both color and layout.
As you can see, GLM-4.5V's page cloning capabilities are very powerful, capable of nearly lossless restoration of the original page after uploading an image.
You can try it out and verify the results for yourself—the results are truly impressive. In real-world scenarios, for example, if we're a small team that loves the style of the Granola website, we can optimize the cloned version instead of starting from scratch.
It's not just images—GLM-4.5V can also clone webpages from videos. I recorded a 15-second video of the Granola website's operations and uploaded it to GLM-4.5V.
The prompt is also very straightforward: “Help me generate the HTML code for the front-end web page shown in this video, including all clicks, jumps, and interactive operations. If image materials are needed, please find suitable image URLs online and use them directly. Do not use placeholders.”
In less than 3 minutes, GLM-4.5V built the entire front-end web page based on the video content.The interactions, layout, and images were all complete, completely reproducing the actual operations in the video.
Below is the reference video I uploaded (please note that the video should not scroll too fast):
This is the actual effect of the page generated by GLM-4.5V.
Attentive students may have noticed that there are some subtle differences between the page generated by GLM-4.5V and the reference page, such as the font being different.
GLM-4.5V supports online editing. Simply click the edit button in the top-right corner, select the module you want to adjust, describe the changes you want in natural language, and the system will automatically optimize the page for you. As shown in the GIF below:
In addition to online editing, you can also copy the code generated by 4.5V to your local machine for manual adjustments. Whether you choose automatic editing or manual modifications, it's very convenient.This is what true productivity improvement looks like.
Replicating other people's webpages is just the appetizer. Think about it, we can feed the designer's webpage design directly to 4.5V and let it generate the webpage.
In addition to generating webpages from images or videos, GLM-4.5V also performs well in image recognition.
I uploaded a photo I took earlier at the 798 Art District, and GLM-4.5V analyzed the building's structure and colors in less than a minute to accurately identify it as a building in the 798 Art District. The recognition speed and accuracy were surprising—I wonder how the AI achieved that.
Based on the capabilities of 4.5V, we should also be able to create some interesting image applications.
For example, take a photo, identify the objects in the image, and display the corresponding English words. This could be an interesting English learning product. Another example would be to upload a video, let the model understand the content of the video, and then create corresponding text...
Over the weekend, I chatted with a good friend from Zhipu. He said that after the Spring Festival this year, he felt very down. He was also worried that, as others had said, after DeepSeek, other startups would have no chance. Fortunately, he persevered.
Recently, since the release of GLM-4.5, he has started using his company's model frequently in work scenarios such as coding and writing. Before that, although he was a veteran at Zhipu, the company was very open and did not require employees to use their own products, so he mostly chose cutting-edge models like GPT and Claude.
Now, the latest generation of models can reliably handle daily tasks. With the gap no longer as significant as before, he naturally prefers to use his team's own results. Especially after the release of 4.5V, he clearly felt that the model team was gradually regaining its own rhythm.
ZhiPu was one of the first companies in China to enter the large-scale model race. Over the past two years, this field has seen significant ups and downs, with the company experiencing both periods of being highly praised and periods of being questioned. The early investments did not directly translate into advantages, and outsiders once believed they had been left behind by competitors. However, they remained in the game, focusing their efforts on refining the models.
From GLM-4.5 to 4.5V, their direction is clearer: continue to maintain an open-source approach while ensuring the model's stable deployment in real-world scenarios. This generation performs better in tasks such as long video analysis, cross-modal reasoning, GUI operations, and geolocation.
As my friend put it, benchmarks are just one dimension; what truly matters is reliability in real-world scenarios.
This time, it feels like Zhipu has finally crossed the finish line.



Comments
There are no comments for this story
Be the first to respond and start the conversation.