#DRAFT

Introduction to squid “StoreID” feature – by Eliezer Croitoru

- Squid developer and Linux System administrator

►This is not another tutorial

This is not another tutorial about installing and configuring squid.

It's more about learning some HTTP internals.

If you are not familiar with these you would probably understand them on the fly.

Questions are always welcome and no matter where you are and what your name!

This is a hacker like world point of view on HTTP rather then a ADMIN.

What squid actually does?

- squid does cache and no-store.
Like many other cache mechanisms That works on lower levels in order to alllow upper levels faster traffic, Squid cache HTTP and some other protocols.
There is a difference between cache and store.
Store suppose to allocate persistent copy of data while cache suppose to allocate data for a specific period of time.
Every Store and Cache have limit of time but most of the mediums that are being used for store are for a longer period of time which considered to be Persistent.

 

- Squid is based on the idea that each http request has only one response in the world. the above allows "caching" and mapping of http requests to stored http responses. it sound pretty simple but it's much more simple then that.
Let say I want a cookie from the store I just need to tell the guy in the store “I need cookies” he can ask for a manufacturer but for most kids they will give the “candy” that most kids would take.
The idea beind url is to be
Uniform resource locator.
Like a school uniform is 1 to many students this is the same for content on the internet.
There is one identifier for one content which can be served to a lot of clients.
When a request passes through squid it will analyze the request and decide if the response at all could be cached.
On cases which the response can be related to a specific request  then squid tries to findout some stuff about the response “Cachiness” which is very simple to force using something called refresh_pattern.
When squid found out that there is a way to identify and map a content that is cached it's making sure that the cached content is still valid and is a reliable source for a response.
All the above is actually squid logic in couple sentences.

What is the great powers of squid? VS other solutions?(no we don't do that but.. we will)(even commercial ones)
- The great power of squid is the people who works on the code and lot's of admins that use the product since it's easy and simple to use and implement
in production.
- Thousands of users around the world can support the development cycle and testing of new and advanced features with the limited resources the project has.

- There are great commercial solutions out there and squid pushes the limits forward which gives a great alternative to the commercial ones.(not talking about forcing higher products quality)

What algorithms\ideas are being used in order to cache HTTP objects.

-   lru       : Squid's original list based LRU(least recently used) policy

-   heap GDSF : Greedy-Dual Size Frequency

-   heap LFUDA: Least Frequently Used with Dynamic Aging

-   heap LRU  : LRU(least recently used) policy implemented using a heap

There are couple algorithms integrated in squid that decides on how and when squid will remove an object from cache.
Today we are talking about cache sizes that exceeds the need to remove many objects from cache.
Let say that 40% of the traffic is being cached and the traffic is an average of 1Gbps there will be enough cache in 300GB to satisfy the clients.
There is a big difference between not consuming any bandwidth to consume avg usage.
It is a proof that most caches uses LRU and do success in their mission more then avg.
There are more complex algorithms out-there that can get better results but they are environment specific and not as global as the implemented policies inside squid.
LRU was working on CPUs and other softwares commercial and non for very long time and this is the reason for that LRU policy to be default on squid.

- each response and request basic identification is the url.

- there are requests which are unique either by the request headers them self or by the response headers.

- Squid by default holds in memory the http objects that allow and wanted to be cached.

- Squid has the refresh_patten acl that can force longer caching then the author of the site planned or designed in cases which the site is either not friendly with cache or another reason.

- caching torrents based on squid traffic.

A bit on tools that we have in squid(debug_options) that can help us identify cachability of a HTTP object.

What and how StoreID is related to all of the above and what it does?

- how squid relate to a request and response.

- is there any alternatives? What is the cost of them?

- how powerful StoreID is for real?

- StoreID actually only gives a small upgrade to what already exists.

- hash calculation of a url\object(the case of vary).

 

require 'digest'

require 'bindata'

 

class StoreID < BinData::Record

  endian :little

  uint8 :method_num

  string :url

end

h = StoreID.new

h.method_num = 1

h.url = "http://www.ngtech.co.il/302.html"

Digest::MD5.hexdigest h.to_binary_s

=> "23bdb1f4e1dc229585944adafd0c2bb0"

- What squid gives you else then HIT ? A LOT!!

- Comparison of store_url_rewrite vs StoreID

- Couple real world scenario like youtube and others on the fly. (- ruby why?)

- Bugs that existed on store_url_rewrite:

2248 - Caused a problem while unpacking meta data from object on disk

- Since the design of store_url_rewrite was to answer only specific needs without considering other stuff that was later considered in newer versions in the 3.x branch the above bug was introduced.

- StoreID do two things about it: 1. design based on more knowldge 2. a far more tested code and understanding of squid internals.

2678 - rewriting doesn't work on resources that vary header exists

- Since StoreID design took in cosideration that vary headers change the cache object md5 hash it dosn't even has the chance to touch any related vary stuff.

2691 - TCP_SWAPFAIL_MISS and memory leak

- While testing StoreID ideas I have found out what can be the reasons to a TCP_SWAPFAIL_MISS and there for all the relevant issues that could have come was prevented from happning while runing squid.

- The above bug could have happend in a specific case of swap to and from cache_dir and not in memory. The reason for that is that when the cache object is in memory you can alter many things that on disk object tests will fail while swaping from disk into mem\client.

302 youtube responses are being cached

- The above is not really a BUG. what youtube systems or caching systems do is redirect from one server to the other. from a far server to closer.

- Squid dosn't have any way now a days to let say "don't cache 302 responses from this domain".

- The only valid way is to use ICAP service and set the cache-control header to "no-store" which youtube will never use for something they want the browser to cache.

adrian chad on google maps caching

** what is the bug and how it was resolved

- Bugs that do not exist on StoreID and why they do not exist.

what bugs do exist when using StoreID and why they do happen?