How Much Content is NOT Indexed in Google?

Articles, Blog

How Much Content is NOT Indexed in Google?

How Much Content is NOT Indexed in Google?


Today I wanted to talk about an issue
that we found that is actually related to JavaScript and SEO, but it’s actually not –
it’s not really JavaScript. So we found out that quite a lot of content is not
indexed in Google. Even though the websites that have content are not what we
would say are JavaScript-powered websites. So when I was preparing to this – to this –
to this presentation, like quite a lot of people reached out to me with questions.
And actually I talked to Bastion Grimm – that you probably know – and he was
like, oh no, you’re gonna be talking about JavaScript SEO – isn’t that dying? Isn’t that
ending? Isn’t that something that’s a little bit obsolete by now? And I was
like, I’m gonna make it – I’m gonna make it really, really interesting for – for
for you guys. Because we found out that there are some interesting issues to
talk about. A while ago I actually recorded this video I was talking about
the problem of JavaScript SEO is dying, because once Google is gonna get
very, very good at rendering JavaScript – the whole need for JavaScript
SEO is gonna disappear. So again I recorded a video about that.
And then a few months back I got invited to Zurich by Martin Splitt and – for
who of you know who Martin Splitt is dealing with? Raise your hands. Okay, most
of you. So if you don’t know, Martin Splitt is the person in Google responsible for
dealing with a person like – people like me. So JavaScript. With people who have
issues with JavaScript SEO on their website. So Martin invited me to Zurich
to Google Hangouts, and we actually had a lengthy conversation of both Martin
Splitt and John Mueller. And I was pretty sure I don’t have John Mueller’s picture
in this presentation. I kind of forgot about this one. So I shared my opinion
with both John Mueller and Martin Splitt about, okay, JavaScript
SEO is dying. Because once you guys are gonna have more computing
power to – to render it, it’s not going to be needed anymore. And they actually both –
John and Martin – went on like a seven – ten minute monologue, or we could call it a
sales pitch for our services – saying that JavaScript SEO is going to be needed
more and more and more. But that was all before we had all the interesting
research. So after this conversation we started to look into some of the
interesting problems we found. And as we as we did, we actually realized that
JavaScript SEO is a little bit more complex than you thought. So it’s not
only websites like Hulu.com, not only websites like Google Flight that are
still completely not – that are still having massive issues with
indexing their own content because of JavaScript. Now we found out that all the
community is saying, like, there are a lot of people in the community saying that
JavaScript is evil – for a good reason. Because it gets – it makes our work
extremely complex. It gets – it makes SEOs Work – we need to do a lot of extra steps
because of that. So I wouldn’t say that JavaScript is evil – I guess it is very, very
complex. But there’s a very good reason for that. But the main problem with
saying that JavaScript is evil is that most of the websites right now moved out
from HTML to HTML with a lot of JavaScript. So it’s going to be very
difficult to find any websites of like that you work with here that doesn’t
have JavaScript. And as I found out just yesterday night, Sixt’s website, sixt.de – I
know at first and that it’s in all the websites that you’re going to replace,
even this website has a very interesting JavaScript SEO issue but none of us
would call sixt.de a JavaScript- powered websites. It was something that we
would always refer to as HTML. Let me move forward with that.
So one of the things that the Googlers said in Zurich that even HTML pages can
be rendered mostly because sometimes it’s just cheaper for them or not much
more expensive than rendering – than not rendering them, but we found that
JavaScript is a problem for non- JavaScript websites and we want to add –
this is something I want to expand on today. And as an SEO community we are
used to what I call post factum learning, so it’s basically wait until
something breaks completely. There is a lot of drama in the community and then
everyone is gonna fix that. And I guess the problem that we’re seeing now with –
with JavaScript is that the change is happening but it’s happening very, very
Slowly. (This is like a very complex slide to understand.) It’s changing slowly,
so, so the change is happening but slowly. And it’s not something that’s going to
make any website drop massively Overnight. And, as I said, JavaScript SEO
evolved now, from JavaScript SEO websites that we would see before to all
websites we worked with. And the irony behind all that is – one of
my favorite examples is – I used this article for years, for a year, as an
example of a very interesting idea, but the content of this article is not
Important. The article is written by Googlers, but by, by Googlers, about the
cost of JavaScript. It’s actually not Important. What the article is exactly
about – the article is published on Medium. Who of you would say that Medium is a
JavaScript website? So seriously, so, so if you look at the comments underneath
this article and they are not indexed in Google. And for me, that was like, one of
those extremely geeky jokes like, I had a lot of laughs after I found this one. I’m
guessing that, and whoever of you laughed like you’re a geek. Anyway, if we look at
this very page it has 500 referring domains, 2,000 backlinks, like a tremendous
amount of reads, claps and whatever. So it’s a very popular piece of content
that we would assume that Google is gonna crawl and render and index pretty
often. Now the problem is that this post was published more than a year ago and
Google still didn’t find it. Because it’s – the whole comments part of Medium is
powered by JavaScript. And now we’re going into – get into why this
exactly happened. So for a few years now we were thinking that there’s – yeah
there is, like it was warm. There is a time frame of JavaScript indexing that
we’re waiting for two waves of indexing to happen and all that. But it actually
got a little bit more complex with some of the research. I’m gonna
explain in a second. Long story short we found that thousands of domains are not
fully indexed and even after months from publishing the content, so even if you
publish your article – amazing article today, it may happen that it’s not going
to be like, the URL is gonna be indexed but the content is not gonna be indexed
for a few months. Or someone else is gonna overrank you for your own content,
which we’re going to get to the Sixt example in a second. So most of us before
that would blame JavaScript. I think the future proved that statement if you’re
gonna do that, just blame rendering. Because it kind of evolved towards the
rendering recently, and JavaScript is not as big. But let’s go about (that’s the
biggest logo I could squeeze on a hoodie before going there, just FYI). So let’s
talk about what Googlers told me in Zurich which was actually very, very interesting.
So I asked them how rendering works with Google and I was saying, okay you’re
looking at the difference between the initial HTML (Sorry this is like small print).
And they look, okay, if they render the content, they see if there’s any change.
So Google is going to look at the HTML version and they’re going to compare
that with the rendered version. Now so if we now think about Medium, let’s
say that you’re going to publish an amazing article on Medium, Google is
gonna render, like compare the version of this article with the rendered article,
there’s gonna be no difference because there are no comments. So this is where
we actually start to see a massive issue with the heuristics. They, they
they’re having to – to find if your content relies on JavaScript. And I can
just imagine, Martin said that he hasn’t fully grasped what triggers the
heuristics, it’s not because Martin is not good with that – he’s an amazing guy.
It’s just basically, I’m guessing that those heuristics are somehow relying on
machine learning and things that are just not human readable. So they would
see, okay there’s certain heuristics that if they see after a while – they look
at the difference between the rendered page and not rendered page. So again
the Medium example. But I would say that those heuristics are still in the infant
stage. They’re still pretty new, they’re still playing and optimizing them, like
The Google algorithm in 2006. You probably remember how easy those times were.
And those heuristics are far from perfect. And what Martin actually said is
that all new websites get rendered and this is extremely interesting because
from my point of view, okay, what’s a new website? And the second problem I had,
okay, all of our experiments that we did at Onely were based on new domains, new
IPs and so on so, most of our experiments were kind of useless from this point of
view. So what’s a new website? So if you’re gonna – if you’re going to relaunch
your CMS, so if you’re gonna publish a New, new sixt.de. Is it going to be a new
website or it does have to be a new domain? Or what if a new website doesn’t
have some kind of content that’s user-generated? And so we started playing
with that and we’re like, ok, with a lot of clients you would actually advise to –
to do an experiment on staging before publishing a new CMS. This was
dumb looking at looking at how it’s structured, like you can’t really test a
new CMS because you’re most likely gonna use a new domain. And you can’t really
index and like play with that within your actual domain, so I wouldn’t say you
duplicate your content within sixt – so sixt.de. So we decided to
experiment with how Google, how good Google is with their heuristics – we
started playing with that a little bit more. And we started from rerunning
all the experiments from 2019 because we found out that Google got much better at
at indexing JavaScrip. So long story short, we had three domains, obviously new
IPs, new domains, content generated with with Articoolo and we compared that to our
experiments from 2017. In 2017 we had a page where the homepage would link down with
HTML links to like six or five levels deep, and with JavaScript, Google would
only index homepage, go one link deep, because the link was generated by JavaScript,
and they would give up. Actually this page is still not indexed
after two years. So we repeated that experiment, just with a lot of different
domains like jscrawling.party or HTMLcrawling.wine. We went all
crazy on the new TLDs. And like jscrawling.pizza is one of my
favorites. So we played with that for a bit and all of those were indexed within
literally minutes. For one of those you had to wait one day, but then it was
indexed within minutes. So we saw a massive change – massive improvement with
how Google is dealing with JavaScript content. But now we know it’s
all because it was a new domain so our experiment from 2019 turned out to be
somehow widely successful, but this is not really useful for any of us
because this was a new domain. So Google actually switched on crawling and
rendering for a set period of time that we don’t really know, and just to see
if JavaScript is changing the content here. So the new Google actually wins.
And, yeah, that’s a great job, like that’s a massive improvement – that’s a massive
Improvement with every single experiment. We couldn’t create an experiment even
with a massive JavaScript load, so like heavy JavaScript scripts to force
Google to somehow fail to index it. And we played with HTML – JS ratio, so we
figured maybe it actually depends on how much content, looking at the Medium
example, is generated by JavaScript, so we had one (sorry this is kind of small
print – I didn’t think it through) entire content so everything injected by
JavaScript – just one paragraph and one word. So we had three types of test pages.
All of them were indexed within, again, almost in minutes. So most of the content
was indexed within a half hour. 5 URLS we had to wait for a little bit and for
Google just to crawl it – not to index – just to crawl it. But then it was indexed
again almost instantly. After four hours All 29 out of 30 pages were
indexed and after eight hours the whole – all the test domains were indexed
Completely. So this actually turned out to be a massive win for Google as well.
We couldn’t somehow force Google to fail with indexing. Something that wasn’t
possible two years ago at all. So as I said, this is quite a change that’s not
somehow visible in the industry. We don’t give up easily. So we we figured we’re
gonna create one more experiment. We’re gonna relaunch this experiment from 2017
that was massively popular. We did. Long story short, I’m guessing you know where
it goes, again Google didn’t choke on any of the scripts, any of the frameworks, any
of the setups, inline, external, doesn’t matter. So again Google won. So Martin
Splitt was completely right about all the new websites. This is something that they
did – they designed well and it actually works as they designed it for, again, new
websites. But what about popular websites? So if you’re – we can’t really play with
experiments, we’re going to look at existing domains.
And this is where it got really Interesting. So can Google deal with real
websites like sixt.de, and all of the other websites that have a little bit of
content generated by JavaScript. And this was one of our most complex experiments
that kind of dragged for a few weeks. It got a little bit out of control because we
spent way too much time on it after seeing some of the changes. So National
Geographic – it’s a page – it’s a website that’s actually almost completely – you
would say that this is a JavaScript- powered website because when you switch
off JavaScript, everything disappears. Like the content is invisible, almost
almost completely – there is just a headline. Fun fact: this website has no
issues with JavaScript indexing. We used a lot of random samples. We could never
get National Geographic to choke with indexing, so this was a first for us we
actually saw, ok, there is a massive brand not having any JavaScript issue. And
again I never know how to pronounce that. (Offscreen: ASOS!) Thank you so much.
I keep forgetting that. ASOS with JavaScript, ASOS without JavaScript, and
so most of the content is gone. And as you can already guess, 100% of
JavaScript cannot index. So this is a massive change as well, something that
wouldn’t be – we couldn’t see in the wild two years ago. But not every website is
lucky enough. This is enough of the positive examples here. We’re gonna go
into the most interesting zone of things that don’t work, which is usually what
SEOs love the most. So let’s – this is a random sample – so we
didn’t – we actually took a random sample of URLs and and this is where things got
really interesting. So with a random sample, it was few hundred of URLs, Urban
Outfitters, no JavaScript generated content indexed, J. Crew, Topshop, Sephora
40%, H&M 73%, and T-Mobile which is obviously run by German SEOs that – that –
that did the best – it doesn’t surprise me at all.
So this was a random sample. Now let’s look into the two waves of indexing,
how that actually works. What’s the timeframe of that? So we figured, okay, we
know that something is not indexed, how is time gonna affect that? We looked
at some of the interesting domains and this is a percentage of JavaScript
content not indexed after 14 days. So we would see – we would fetch sitemaps – we see,
okay, this is a just published page – we would wait until the HTML or the URL is
indexed and from that moment of the URL being being indexed, you would measure time
until JavaScript content indexed, and this kind of – this is the
project that got out of hand because we – it grew way more than we expected.
So the Guardian has a 66% of content not indexed after two weeks. And you would
assume that a newspaper – and this is not like a tiny bit of page – this is like a
massive – I think it’s you might also be interested in or like all everything
that they do for internal linking, or maybe not everything like – good bunch of
links they used for internal linking, for new content is not indexed. Target 30%.
New York Post is actually very good at that. But CNBC, that’s… I won’t
even comment. And CNBC had a massive JavaScript issue.
None of those websites here is something that we would… a year or two years ago with
one exception called JavaScript powered websites, and this is where we
as a community kind of failed recently, because we would blame
Google for that for quite a bit, and because sometimes as a community
that’s what we do best. I was guilty of that as well, because looking
at JavaScript SEO over one or two years recently is most of what SEOs did for
JavaScript SEO, just recommended pre-rendering and that’s it.
In this case I feel it gets a little bit more complicated because every single
JavaScript SEO issue we saw was 100% self-induced, so we need to look at
ourselves and developers and webmasters to fix those, because a lot of
those issues are — actually, every single one we worked with wasn’t Google’s fault.
It was basically a design flaw of how we use JavaScript. The JavaScript community is
growing very fast as well, so this is one of the side effects. Moving forward and
talking about the timeline, we’re waiting for crawling and indexing
to come together. This is something that Google is telling us they are going to
announce pretty soon-ish because they are, as we see, getting better and better
with indexing the content. I would say that this is happening soon. I
wouldn’t say this is happening next year or this year but we’re waiting for that.
So when the two waves of indexing come together, in theory. the problem of
JavaScript is going to fade away. But it’s gonna be a while. This is first. And
secondly, I wonder how Google is going to deal with with pages so they would have
to render everything, like Medium, like The Guardian, so I wonder how granular
is it going to be and at which point they’re gonna – we’re rendering
every single page online. So what to do? We had some big news a few days
ago and I want to explain that a little bit so we created OMFG, Onely made
for geeks, a toolset that actually helps you, like a free tool set that
helps you see some of the problems because we said “okay, there is no way to
see if your website has a JavaScript problem or not.” It’s still a very early
version. We developed that fully within the last three weeks or four weeks,
actually. And there is a very good chance it may explode when
you play with that at some point just email me and
I’ll apologize. So TGIF is the Google indexing forecast part of
the toolset. We look every single day at the percentage of pages with
JavaScript that’s not indexed. This is a manually created data set of large
brands and footprints, and we see how is it changing just for us
SEOs to see Google is getting pretty close to indexing everything properly so
And you can see also like after one week or after two weeks if
anything is going to change. So we are including any brand, so if you send us,
for example, The Guardian or any other page, if you send us the page we see
there is a part that’s realized in JavaScript, we footprint that part,
so we created something like “you may also be interested in…” we fetch the
sitemap. It’s quite a lot of work to do that manually, but I think that the
database now is going to around 100 to 200 websites – big websites – but for each page
we take quite a lot of URLs. It’s constantly growing. So last time, I talked to our
research development team and they were adding like tens of pages per day, so
it’s growing. And we manually footprint that. There is no way to automate
that, actually. You can also compare what’s —
if you want to get geeky — HTML delay and JavaScript delay. When JavaScript is not indexed this is not a JavaScript issue, this is a crawler budget problem
They said if your JavaScript is not getting indexed look into your crawler
budget. So I figured “maybe there is some kind of correlation this is
something this is an experiment we’re working on.” But we saw that quite a lot
of pages struggle with having the HTML content indexed within even days, so this
was a very good lead to look into. Anyhow, you can have a look at, for example,
after two weeks you can see some of the pages, like 90-something percent of the
HTML indexed. Out of that, 70 percent of JavaScript is indexed, so you can see the
delay here is really big still. So after two weeks, HTML indexing for some of the
dates would be like eighty percent for very big brands, so this is
something we had to like quadruple check to publish because we couldn’t
believe that. There’s one tool that’s actually
quite interesting within the toolset. WWJD – What Would Javascript Do
obviously, and it compares the JavaScript disabled version versus JavaScript
enabled version to see if there’s any difference. You can see on Hulu.com that quite
a lot of content disappears and we could assume that this is the content that’s
realized in JavaScript. In this case these are images, but if you have any
text there’s a very good chance that Google is not going to pick it up, as
comments from Medium. If you look at bbc.co.uk, there’s quite a lot of content that relies on
Javascript that disappears without JavaScript indexing. There’s also one interesting thing that
we compare, is the non-rendered version versus rendered of major meta tags,
because quite a lot of websites would change their canonicals noindex with
JavaScript. And to be honest with our experiments we’re never sure
which version Google is gonna respect more because we serve both or mixed
versions in our experiment. So again, looking at BBC — this is a very
interesting example — for HTML, BBC is showing a BBC home for
JavaScript, after rendering it changed to the BBC homepage. But it gets
really interesting with canonicals because after rendering Javascript
the domain is gonna change. Again this is BBC so you would
expect that to go a little bit better. This may be somehow influenced by how we test
that, because we still can’t believe that they would do it like that, but even
if it is this is something to look into. But these problems are definitely for
for BBC to fix. Also, what we (I have arrows – fancy arrows,
I’m sorry. These are two animations I added myself, you can see
how splashy they are. Oh! Three! I got kind of crazy.) So you can see one more thing
that’s actually very interesting – links added by JavaScript. If you crawl your
website without JavaScript, for example, Ryte or DeepCrawl, or crawler
of your choice, you will see two massive data– like two different datasets.
There are different line graphs for those two websites.
Which one matters in the end? This is something we can’t agree on
internally in the office. One of the easiest ways to pick a fight at Onely,
just ask that question. Also, links removed by JavaScript, which is even more
interesting because then it gets really confusing at this point. So if we look at
BBC this problem is real. And Too Long; Didn’t Render, so TL;DR, is the last
part of our amazing toolset, where you can actually see the cost of rendering
your page. You can see it’s based on CPU and memory. So in the case of SIXT, you’re very good. And there is one winner in our case, I had to run it a few times to just
to go to the green zone. But there is one page– okay, BBC kind of got
crazy with the how they are, so… why do you need that?
I should lead with that. You need that because if, in the case of the BBC, if your users
have cheaper mobile devices that– this topic I spoke about quite a
lot of times already– if your users have cheaper mobile
devices, BBC is gonna choke on them. So like, Motorola’s G4, cheaper Android
devices, older iPhones are not gonna deal well with
such a load of rendering. So there is one page at one, this page – this page is still
our number one. This is SEOktoberfest – this page made by Marcus [Tandler]. There is –
there’s zero CSS and the score is – the score is 2. So this is the
most amazing score we had because the cost of rendering is almost zero and
it’s only because of one image. I guess if you would remove that image it would
go down to 1, so maybe you should go backwards in development. But anyhow
this is what I wanted to share with you. This is actually a use case of how you
can play with our tool set. So I recorded the video to show the
difference between the rendered and not rendered sixt.de, so this is
JavaScript disabled. Quite a lot of content is going to disappear. The
website is much smaller. With JavaScript enabled, all the content here kind of
appears out of nowhere. And this is problematic if you pick any of this
content and you Google for it and find these websites. So I know that Sixt – I
asked Thurston before coming on stage – I didn’t want to be one of these guys, like,
getting invited to an event and just then make fun of the website. So I asked
. . . the website completely. But this is – like most of the
content from sixt.de that relies on JavaScript, if you Google for it you
won’t find sixt.de ranking for that. But it’s indexed, so if you use the
site: command in Google search, you will find it. But you can see that some other –
like what we see quite often, if your content relies on Google it is very easy to
over rank you with content that’s rendered. So, yeah, that’s more or less it.
there are no links – there are links added by JavaScript. So this is a
massive pain in the ass for your technical SEO team or your – because if
some content can – if you add links your internal links graph is going to
change. This is you – I can I can see Tomas is like, “Yeah, we had that
problem already.” So this is just a use case of how you can do that.
You can see that this problem also comes up in mobile-friendly tester somehow. And
there are quite a lot of tools we’re actually building to launch, but you can
see that this is quite useful to play with your domains. It’s completely free
and just to answer some of the Twitter questions – we’re not gonna gather your
website’s data and pitch them. We actually never pitch – we don’t even have
a sales team, but people are – were very afraid that if they play with that. So we –
yeah, people enter all the weird domains in those. I wouldn’t want to call them.
And now let’s step away from JavaScript for a second. Let’s talk about
HTML. Let’s go old-school for a second. So let’s see how quickly Google is
gonna index HTML content from The Guardian. Looking at 1300 URLs, Google
indexed 98% of the HTML content from The Guardian. Pretty decent – I wouldn’t
complain, but it doesn’t look as good for other brands. So the Guardian is fairly
good, like most of the HTML – this is just HTML. So just if the URLs index the
Guardian is very good – like, if but if we look at, for example, Target, and you
can expect an e-commerce website that’s fairly – fairly big in
the States to be very well. So after two weeks they get to 80% – two weeks from
publishing products and content. So we can see, okay, this problem we’re seeing
is much bigger than just JavaScript SEO. Because if we look at, for
example, okay, Eventbright, Eventbright has 55 or 56 percent of their content
indexed after two weeks. So you can imagine how much of a problem that is.
Because, because they will optimize those pages. They’ll ask, “Okay, why don’t we rank well?”
When actually half of their pages are not indexed, half of their domain – so, so, so
HTML seems to be a very problematic. This is the second time I’m testing – testing
this amazing joke. So easy, and don’t look, and I think Philip saw that as well,
so our Head of Research and Development, Tomek, and so Medium – Medium is one – my
final example, Medium is medium in indexing. That’s the joke. Thank you.
So with the quick – with a quick check of 100 URLs from Medium, only 70% – and
this is a random check – like this wasn’t like after a timeframe, 70% of them are
indexed in Google. So Medium has massive issues indexing their content – something
you wouldn’t expect from a content platform. So, and then 50% of the content
that’s indexed has JavaScript content indexed. So this is kind of geeky, like you
know, one thing with – that’s the medium value. So that’s the end of the joke. So I
tried it the second time – oh, okay, it’s funny in Munich. Okay, noted. And so you
can see that this is a problem because – again, this is not a JavaScript problem.
This is basically a crawler budget HTML issue. And just one last
slide – this is an example that I’m using quite a few times in a lot of
conferences to actually show that this is dangerous. This is something to
either, depending on the side of SEO you are either playing with, or avoid, but we
created quite a lot of pages with content that somehow sensible to a lot
of people like gun control, Trump versus Hillary, or Peppa Pig – if you saw the
Peppa Pig drama with some of the violent content. And we basically
did a website that shows two completely different stories depending on if you
switch on or switch off JavaScript. Google couldn’t pick it up for – this one
I think for a year. So you can actually visit the page and you can
cloak anything with just – with JavaScript and, and this is extremely
dangerous because if you Google – okay, Google is seeing that there should be
gun control. So all the search engines will see, okay, gun control
should happen, but once you render JavaScript – JavaScript is replacing not
only the header but the whole content to no gun control. And we saw some
examples in the wild of people doing that for some of the big brands in the US
which was in the car industry as well, but not not currently unfortunately.
So a lot of people are playing with that to somehow, you know, inject quite a lot
of content for search engines, but not for users. So they would say, okay, you
would see – what we actually saw is – you see just a listing page with just a
photo of let’s say a car, and in the description, but Google is seeing like a
massive spreadsheet of data and everything, and this works very well
still. This is something that Google can’t fix for, I’m guessing, technological
reasons. Or maybe they don’t – maybe the scale is not big enough for them to
worry about it. So this is something I actually showed it to Martin Splitt as well. So,
so maybe they will somehow address that but this is a massive issue. So again depending on the side of SEO you’re on, you’ll avoid that or play with it. More data and more tools are coming soon. So just stay tuned. And thank you so much.

Leave a Reply

Your email address will not be published. Required fields are marked *