![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
So lately there have been a lot of complaints about nss_ldap-249+ breaking systems on boot. The source of this is actually not a breakage, but a change in behavior that exposed something that was always broken. Many of the comments below go for all NSS backends where the actual data source might not be available during the early phases of booting (because the LDAP server may not have started yet, or network may not be started).
In your /etc/nsswitch.conf file, you may have lines like:
passwd: files ldap
group: files ldap
If you have it the other way around, that's the first cause for breakage. The always-on sources need to be available at system boot time.
During boot, nearly every init script causes at least one lookup, in the cases of things like udev, it causes a lot of lookups, as it needs them. If it can find everything from the files nss backend, then it doesn't need to go to LDAP (or any other unavailable backend). In the case of udev, for a very long time there has been this rule:
/etc/udev/rules.d/50-udev.rules:KERNEL=="tpm*", NAME="%k", OWNER="tss", GROUP="tss", MODE="0600"
This causes udev to look up the user and group 'tss' (that's two lookups). Does your system have a 'tss' user and group? Unless you have the app-crypt/trousers package installed, you probably don't.
Ok, so if this has always been a problem, why did it suddenly turn up now? nss_ldap-249 has a change of behavior (badly documented by upstream unfortunetly). It changed from a hardcoded timeout numbers to using configurable timeout numbers, and greatly increased the timeout values. Previously, if the server was not available or otherwise had issues, nss_ldap failed out after at most 30 seconds (and a lot less if the server IP/port were actually unreachable). As of 249, it takes 124 seconds. It tries twice, then waits 4 seconds, then another 8 seconds, another 16 seconds, another 32 seconds, and finally another 64 seconds, with an attempt between each of the waits. Unfortuntely this behavior is serial, and happens for every lookup. udev tries to look up user 'tss', then group 'tss', etc. On some systems, this made the boot-up unbearly slow, as there were 30+ lookups that went to nss_ldap, at 2 minutes each, leading to an hour of waiting before the actual login prompt came up.
How do we fix this?
The proper way: For every Gentoo init script, we need to make sure that every value looked up is actually in the system files, so that no requests go to nss_ldap or any other remote backend. In the case of udev, this is a known flaw of udev, that it looks up stuff it doesn't need to. If somebody has enough time to look at the udev code, upstream would greatly appreciate it - they don't have enough time to do it. You can comment out the tss line temporarily as well if you want.
The temporary hack: I've commited nss_ldap-250-r1 that changes the default timeouts in the header files, as well documenting them, and the old ones, and even faster ones (read: more dangerous) in /etc/ldap.conf.
Side note: It does seem there is something that changed with regards to SSL behaviour in either openldap-2.3.* or nss_ldap between 239 and 249. In some setups, 'ssl on' no longer works, but specifying a plain ldap:// URL instead of ldaps://, and using 'ssl start_tls' works perfectly fine. If you run into this, move to TLS!
configuration workaround
Date: 2006-06-20 08:59 pm (UTC)for my part, i corrected this problem with this hack in config file. I found it was an undocumented feature of the new nss_ldap, and can be used in ldap.conf
nss_reconnect_tries 2
with a bind_policy hard it takes time to boot, but not so much (it tries only 2 times, instead of ... nearly forever). Moreover, i found recently you can change the timeout period with an (undocumented too) parameter. Didn't have the chance (time) for the moment to test it, but i think we can use it as a workaround ?
bye bye
mRyOuNg
Re: configuration workaround
Date: 2006-06-20 10:24 pm (UTC)In specific I said:
If you emerge 250-r1, you will see that nss_reconnect_tries is one of those settings that I have changed. From my patch:
Re: configuration workaround
Date: 2006-07-09 10:47 am (UTC)When this ebuild planned to be in x86 tree ?
bye bye
mRyOuNg
Something to note
Date: 2006-06-29 06:41 am (UTC)In /etc/nsswitch.conf I *temporarily* changed
passwd: files ldap
shadow: files ldap
group: files ldap
to
passwd: compat
shadow: compat
group: compat
and then portage worked fine. I was booted to the live cd and chrooted, if that makes any difference at all. Then I switched my nsswitch.conf back to
passwd: files ldap
shadow: files ldap
group: files ldap
The system boots much better now.
Re: Something to note
Date: 2007-03-22 06:48 pm (UTC)info libc "NSS Configuration File" "Notes on NSS Configuration File"
sleezy way of doing it
Date: 2006-07-20 01:05 am (UTC)Re: sleezy way of doing it
Date: 2006-08-03 12:01 am (UTC)# adduser tss
Re: sleezy way of doing it
Date: 2006-08-03 02:05 am (UTC)glib bug?
Date: 2006-07-27 08:05 am (UTC)passwd: files ldap
shadow: files ldap
group: files ldap
in /etc/passwd
ldap:x:439:439:added by portage for openldap:/usr/lib/openldap:/bin/false
I use a command
/usr/lib/openldap/slapd -u ldap
and nss tries to by connected to ldap
Re: glib bug?
Date: 2006-07-27 08:12 am (UTC)mask versions over 239?
Date: 2006-12-14 07:03 pm (UTC)1) Setting "bind_policy soft" DOES NOT WORK. It appears to work, as things like "getent passwd" return immediately (instead of blocking) after dumping /etc/passwd, but if you try to ssh in as an ldap user, sshd will bail out with:
nss_ldap: could not search LDAP server - Server is unavailable
fatal: login_get_lastlog: Cannot find account for uid 1000
2) Putting the user in /etc/passwd DOES NOT WORK. nss_ldap will go through the passwd file first, and will indeed find the user, but it will still attempt to do the ldap lookup. Once that fails, it will return the info it found in /etc/passwd. There needs to be a way to short-circuit this (see #4 below).
3) Your comments about init scripts are spot on, and affect /etc/init.d/slapd as well. If your ldap server is using ldap auth, restarting slapd will take the entire timeout period (about 3 minutes).
4) I tried messing with nsswitch.conf:
passwd: files [SUCCESS=return] ldap
shadow: files [SUCCESS=return] ldap
group: files [SUCCESS=return] ldap
This seems like the "right" way to fix this problem, but it does not appear to do anything.
IMHO, v249 should be masked. I've got servers running v239 that have run fine for months and handle this in a more sane way (if ldap isn't available, the lookup fails immediately instead of blocking). I just noticed this problem when setting up a new ldap infrastructure. Should I just revert to v239, or is this problem fixed in newer versions?
Re: mask versions over 239?
Date: 2006-12-16 03:47 am (UTC)- Try all combinations of the ssl setting in your /etc/ldap.conf.
In response to specific points above:
#2 - this indicates that something ELSE was being looked up in ldap. Turn on debugging to trace what it was.
#4 - no, [SUCCESS=return] is the default behavior.
The source of #2/#3 is that coreutils and other things occassionally attempt to look up a numeric uid/gid as a string! (eg the first column of /etc/{passwd,group}, and the related field in LDAP). This ALWAYS fails - just that LDAP failures take a long time.